Facing the reality of data stream classification: coping with scarcity of labeled data

Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  E. G. Lyman,et al.  NASA aviation safety reporting system , 1976 .

[4]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[5]  Raghu Ramakrishnan,et al.  Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[6]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[7]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Xindong Wu,et al.  Effective classification of noisy data streams with attribute-oriented dynamic classifier selection , 2006, Knowledge and Information Systems.

[9]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[10]  Philip S. Yu,et al.  One-class learning and concept summarization for data streams , 2011, Knowledge and Information Systems.

[11]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[12]  Grigorios Tsoumakas,et al.  Tracking recurring contexts using ensemble classifiers: an application to email filtering , 2009, Knowledge and Information Systems.

[13]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[14]  Philip S. Yu,et al.  Positive Unlabeled Learning for Data Stream Classification , 2009, SDM.

[15]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Dimitrios Gunopulos,et al.  A framework for semi-supervised learning based on subjective and objective clustering criteria , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Ramakrishnan Srikant,et al.  Kdd-2001: Proceedings of the Seventh Acm Sigkdd International Conference on Knowledge Discovery and Data Mining : August 26-29, 2001 San Francisco, Ca, USA , 2002 .

[18]  Xuegang Hu,et al.  Learning from concept drifting data streams with unlabeled data , 2012, Neurocomputing.

[19]  Philip S. Yu,et al.  Under Consideration for Publication in Knowledge and Information Systems on Clustering Massive Text and Categorical Data Streams , 2022 .

[20]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[21]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[22]  Ludmila I. Kuncheva,et al.  Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[23]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[24]  Ralf Klinkenberg,et al.  An Ensemble Classifier for Drifting Concepts , 2005 .

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[27]  Franco Turini,et al.  Stream mining: a novel architecture for ensemble-based classification , 2011, Knowledge and Information Systems.

[28]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[29]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[30]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[31]  Latifur Khan,et al.  Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels , 2009, ISMIS.

[32]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[33]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[34]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[35]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[36]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[37]  Charu C. Aggarwal On classification and segmentation of massive audio data streams , 2008, Knowledge and Information Systems.

[38]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[39]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[40]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[41]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[42]  Gerhard B van Huyssteen,et al.  Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically , 2007 .

[43]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[44]  Alexander Zien,et al.  Probabilistic Semi-Supervised Clustering with Constraints , 2006 .

[45]  Arindam Banerjee,et al.  Probabilistic Semi-Supervised Clustering with Constraints , 2006, Semi-Supervised Learning.

[46]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[47]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[48]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[49]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[50]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[51]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[52]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.