Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels

This paper outlines a data stream classification technique that addresses the problem of insufficient and biased labeled data. It is practical to assume that only a small fraction of instances in the stream are labeled. A more practical assumption would be that the labeled data may not be independently distributed among all training documents. How can we ensure that a good classification model would be built in these scenarios, considering that the data stream also has evolving nature? In our previous work we applied semi-supervised clustering to build classification models using limited amount of labeled training data. However, it assumed that the data to be labeled should be chosen randomly. In our current work, we relax this assumption, and propose a label propagation framework for data streams that can build good classification models even if the data are not labeled randomly. Comparison with state-of-the-art stream classification techniques on synthetic and benchmark real data proves the effectiveness of our approach.

[1]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[2]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[4]  Latifur Khan,et al.  Multi-label large margin hierarchical perceptron , 2008, Int. J. Data Min. Model. Manag..

[5]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[6]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[7]  Ramakrishnan Srikant,et al.  Kdd-2001: Proceedings of the Seventh Acm Sigkdd International Conference on Knowledge Discovery and Data Mining : August 26-29, 2001 San Francisco, Ca, USA , 2002 .

[8]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[9]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[10]  Ralf Klinkenberg,et al.  An Ensemble Classifier for Drifting Concepts , 2005 .

[11]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[12]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[13]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[14]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.