Mining Recurring Concept Drifts with Limited Labeled Streaming Data

Tracking recurring concept drifts is a significant issue for machine learning and data mining that frequently appears in real-world stream classification problems. It is a challenge for many streaming classification algorithms to learn recurring concepts in a data stream environment with unlabeled data, and this challenge has received little attention from the research community. Motivated by this challenge, this article focuses on the problem of recurring contexts in streaming environments with limited labeled data. We propose a semi-supervised classification algorithm for data streams with REcurring concept Drifts and Limited LAbeled data, called REDLLA, in which a decision tree is adopted as the classification model. When growing a tree, a clustering algorithm based on k-means is installed to produce concept clusters and unlabeled data are labeled in the method of majority-class at leaves. In view of deviations between history and new concept clusters, potential concept drifts are distinguished and recurring concepts are maintained. Extensive studies on both synthetic and real-world data confirm the advantages of our REDLLA algorithm over three state-of-the-art online classification algorithms of CVFDT, DWCDS, and CDRDT and several known online semi-supervised algorithms, even in the case with more than 90% unlabeled data.

[1]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[2]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[3]  Zhi-Hua Zhou,et al.  Semisupervised Regression with Cotraining-Style Algorithms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Yunjun Gao,et al.  A RANDOM DECISION TREE ENSEMBLE FOR MINING CONCEPT DRIFTS FROM NOISY DATA STREAMS , 2010, Appl. Artif. Intell..

[7]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[9]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[10]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[11]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[12]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[13]  Takashi Omori,et al.  ACE: Adaptive Classifiers-Ensemble System for Concept-Drifting Environments , 2005, Multiple Classifier Systems.

[14]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Philip S. Yu,et al.  Decision tree evolution using limited number of labeled data items from drifting data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[17]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[18]  Philip M. Long,et al.  Tracking drifting concepts by minimizing disagreements , 2004, Machine Learning.

[19]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[20]  Xue Li,et al.  OcVFDT: one-class very fast decision tree for one-class classification of data streams , 2009, SensorKDD '09.

[21]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[22]  Xindong Wu,et al.  Parameter Estimdation in Semi-Random Decision Tree Ensembling on Streaming Data , 2009, PAKDD.

[23]  Grigorios Tsoumakas,et al.  Tracking recurring contexts using ensemble classifiers: an application to email filtering , 2009, Knowledge and Information Systems.

[24]  Raj K. Bhatnagar,et al.  Tracking recurrent concept drift in streaming data using ensemble classifiers , 2007, ICMLA 2007.

[25]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[26]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training Style Algorithms , 2007 .

[28]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[29]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[30]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[31]  Shuang Wu,et al.  Clustering-training for Data Stream Mining , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[32]  R. Wallace Is this a practical approach? , 2001, Journal of the American College of Surgeons.

[33]  Wei Chu,et al.  Semi-Supervised Gaussian Process Classifiers , 2007, IJCAI.

[34]  Raj Bhatnagar,et al.  Tracking recurrent concept drift in streaming data using ensemble classifiers , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[35]  Xindong Wu,et al.  A Double-Window-Based Classification Algorithm for Concept Drifting Data Streams , 2010, 2010 IEEE International Conference on Granular Computing.

[36]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[37]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[38]  Yong Wang,et al.  Improving the Performance of Data Stream Classifiers by Mining Recurring Contexts , 2006, ADMA.

[39]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[40]  Philip M. Long,et al.  Tracking Drifting Concepts By Minimizing Disagreements , 2004, Machine Learning.

[41]  Dwi H. Widyantoro EXPLOITING UNLABELED DATA IN CONCEPT DRIFT LEARNING , 2007 .

[42]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[43]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[44]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[45]  David G. Stork,et al.  Pattern Classification , 1973 .