Classification and Novel Class Detection in Data Streams with Active Mining

We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data Most of the existing data stream classification techniques address only the infinite length and concept-drift problems Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems Concept-evolution occurs in the stream when novel classes arrive However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training This assumption is impractical, since data labeling is both time consuming and costly Therefore, it is impossible to label a majority of the data points in a high-speed data stream This scarcity of labeled data naturally leads to poorly trained classifiers ActMiner actively selects only those data points for labeling for which the expected classification error is high Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

[1]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[2]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks , 2008, SAC '08.

[4]  Christophe G. Giraud-Carrier,et al.  Temporal Data Mining in Dynamic Feature Spaces , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[6]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[7]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[8]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[9]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[10]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[11]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[12]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[13]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[14]  Gerhard B van Huyssteen,et al.  Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically , 2007 .

[15]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.