Learning from concept drifting data streams with unlabeled data

Most existing work on classification of data streams assumes that all streaming data are labeled and the class labels are immediately available. However, in real-world applications, such as credit fraud and intrusion detection, this assumption is not always valid. Thus, it is a challenge to learn from concept drifting data streams with unlabeled data. With this motivation, we propose a Semi-supervised classification algorithm for data streams with concept drifts and UNlabeled data (SUN) in this paper. In SUN, a clustering algorithm is developed from k-Modes and implemented to produce concept clusters at leaves in an incremental decision tree. In terms of deviations between history concept clusters and new ones, potential concept drifts are distinguished from noise. Extensive studies on both synthetic and real-world data demonstrate that SUN performs well compared to several state-of-the-art online supervised and semi-supervised algorithms, even when there are more than 90% unlabeled data. A conclusion is hence drawn that SUN provides a promising framework for tackling concept drifting data streams with unlabeled data.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  Masahiro Terabe,et al.  Learning Higher Accuracy Decision Trees from Concept Drifting Data Streams , 2008, IEA/AIE.

[3]  Xindong Wu,et al.  Sequential pattern mining in multiple streams , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[4]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Giandomenico Spezzano,et al.  FlockStream: A Bio-Inspired Algorithm for Clustering Evolving Data Streams , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[6]  Xue Li,et al.  OcVFDT: one-class very fast decision tree for one-class classification of data streams , 2009, SensorKDD '09.

[7]  Philip S. Yu,et al.  Decision tree evolution using limited number of labeled data items from drifting data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Yunjun Gao,et al.  A RANDOM DECISION TREE ENSEMBLE FOR MINING CONCEPT DRIFTS FROM NOISY DATA STREAMS , 2010, Appl. Artif. Intell..

[10]  Richard Granger,et al.  Incremental Learning from Noisy Data , 1986, Machine Learning.

[11]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[12]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[13]  Walid G. Aref,et al.  Nile: a query processing engine for data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[14]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[15]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[16]  Jing Liu,et al.  Ambiguous decision trees for mining concept-drifting data streams , 2009, Pattern Recognit. Lett..

[17]  Philip M. Long,et al.  Tracking Drifting Concepts By Minimizing Disagreements , 2004, Machine Learning.

[18]  Harry Wechsler,et al.  Detecting Changes in Unlabeled Data Streams Using Martingale , 2007, IJCAI.

[19]  Houkuan Huang,et al.  TOPSIL-Miner: an efficient algorithm for mining top-K significant itemsets over data streams , 2010, Knowledge and Information Systems.

[20]  Charu C. Aggarwal On classification and segmentation of massive audio data streams , 2008, Knowledge and Information Systems.

[21]  Dwi H. Widyantoro EXPLOITING UNLABELED DATA IN CONCEPT DRIFT LEARNING , 2007 .

[22]  Hyun-Ho Lee,et al.  Consistent collective evaluation of multiple continuous queries for filtering heterogeneous data streams , 2010, Knowledge and Information Systems.

[23]  Wei Wang,et al.  Efficient mining of skyline objects in subspaces over data streams , 2010, Knowledge and Information Systems.

[24]  John Yen,et al.  Relevant data expansion for learning concept drift from sparsely labeled data , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[27]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[28]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[29]  Lutz Prechelt Technical opinion: comparing Java vs. C/C++ efficiency differences to interpersonal differences , 1999, CACM.

[30]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[31]  Yunjun Gao,et al.  Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees , 2009, MLDM.

[32]  Shuang Wu,et al.  Clustering-training for Data Stream Mining , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).