Full Text:   <2808>

CLC number: TP391

On-line Access: 2011-08-03

Received: 2010-09-25

Revision Accepted: 2011-03-09

Crosschecked: 2011-07-04

Cited: 4

Clicked: 3567

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE C 2011 Vol.12 No.8 P.615-628

10.1631/jzus.C1000330


Clustering feature decision trees for semi-supervised classification from high-speed data streams


Author(s):  Wen-hua Xu, Zheng Qin, Yang Chang

Affiliation(s):  Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, School of Software, Tsinghua University, Beijing 100084, China

Corresponding email(s):   xwh07@mails.tsinghua.edu.cn, zhqing@mails.tsinghua.edu.cn

Key Words:  Clustering feature vector, Decision tree, Semi-supervised learning, Stream data classification, Very fast decision tree


Share this article to: More |Next Article >>>

Wen-hua Xu, Zheng Qin, Yang Chang. Clustering feature decision trees for semi-supervised classification from high-speed data streams[J]. Journal of Zhejiang University Science C, 2011, 12(8): 615-628.

@article{title="Clustering feature decision trees for semi-supervised classification from high-speed data streams",
author="Wen-hua Xu, Zheng Qin, Yang Chang",
journal="Journal of Zhejiang University Science C",
volume="12",
number="8",
pages="615-628",
year="2011",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1000330"
}

%0 Journal Article
%T Clustering feature decision trees for semi-supervised classification from high-speed data streams
%A Wen-hua Xu
%A Zheng Qin
%A Yang Chang
%J Journal of Zhejiang University SCIENCE C
%V 12
%N 8
%P 615-628
%@ 1869-1951
%D 2011
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1000330

TY - JOUR
T1 - Clustering feature decision trees for semi-supervised classification from high-speed data streams
A1 - Wen-hua Xu
A1 - Zheng Qin
A1 - Yang Chang
J0 - Journal of Zhejiang University Science C
VL - 12
IS - 8
SP - 615
EP - 628
%@ 1869-1951
Y1 - 2011
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1000330


Abstract: 
Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bifet, A., Kirkby, R., Holmes, G., Pfahringer, B., 2007. MOA: Massive Online Analysis. Available from http://moa.cs.waikato.ac.nz/ [Accessed on Jan. 31, 2010].

[2]Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R., 2009. New Ensemble Methods for Evolving Data Streams. Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.139-148.

[3]Bifet, A., Holmes, G., Pfahringer, B., Frank, E., 2010. Fast perceptron decision tree learning from evolving data streams. LNCS, 6119:299-310.

[4]Chapelle, O., Scholkopf, B., Zien, A., 2006. Semi-supervised Learning. MIT Press, Cambridge, USA, p.5.

[5]Domingos, P., Hulten, G., 2000. Mining High-Speed Data Streams. Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.71-80.

[6]Gama, J., Rocha, R., Medas, P., 2003. Accurate Decision Trees for Mining High-Speed Data Streams. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.523-528.

[7]Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W., 1999. BOAT—Optimistic Decision Tree Construction. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.169-180.

[8]Gehrke, J., Ramakrishnan, R., Ganti, V., 2000. RainForest—a framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc., 4(2/3):127-162.

[9]Greenwald, M., Khanna, S., 2001. Space-Efficient Online Computation of Quantile Summaries. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.58-66.

[10]Hulten, G., Domingos, P., 2003. VFML—a Toolkit for Mining High-Speed Time-Changing Data Streams. Available from http://www.cs.washington.edu/dm/vfml [Accessed on Apr. 25, 2010].

[11]Hulten, G., Spencer, L., Domingos, P., 2001. Mining Time-Changing Data Streams. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.97-106.

[12]Jin, W., Tung, A.K.H., Han, J., 2001. Mining Top-n Local Outliers in Large Databases. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.293-298.

[13]Li, P., Wu, X., Hu, X., 2010. Learning from Concept Drifting Data Streams with Unlabeled Data. Proc. 24th AAAI Conf. on Artificial Intelligence, p.1495-1496.

[14]Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B., 2008. A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data. Proc. 8th IEEE Int. Conf. on Data Mining, p.929-934.

[15]Mehta, M., Agrawal, R., Rissanen, J., 1996. SLIQ: a fast scalable classifier for data mining. LNCS, 1057:18-32.

[16]Pfahringer, B., Holmes, G., Kirkby, R., 2007. New options for Hoeffding trees. LNCS, 4830:90-99.

[17]Pfahringer, B., Holmes, G., Krikby, R., 2008. Handling Numeric Attributes in Hoeffding Trees. Proc. 12th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, p.296-307.

[18]Shafer, J.C., Agrawal, R., Mehta, M., 1996. SPRINT: a Scalable Parallel Classifier for Data Mining. Proc. 22nd Int. Conf. on Very Large Data Bases, p.544-555.

[19]Street, W.N., Kim, Y., 2001. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.377-382.

[20]Wang, H., Fan, W., Yu, P.S., Han, J., 2003. Mining Concept-Drifting Data Streams Using Ensemble Classifiers. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.226-235.

[21]Wu, S., Yang, C., Zhou, J., 2006. Clustering-Training for Data Stream Mining. Proc. 6th IEEE Int. Conf. on Data Mining, p.653-656.

[22]Yu, H., Yang, J., Han, J., 2003. Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.306-315.

[23]Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.103-114.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE