Efficient Active Novel Class Detection for Data Stream Classification

One substantial aspect of data stream classification is the possible appearance of novel unseen classes which must be identified in order to avoid confusion with existing classes. Detecting such new classes is omitted by most existing techniques and rarely addressed in the literature. We address this issue and propose an efficient method to identify novel class emergence in a multi-class data stream. The proposed method incrementally maintains a covered feature space of existing (known) classes. An incoming data point is designated as "insider" or "outsider" depending on whether it lies inside or outside the covered space area. An insider represents a possible instance of an existing class, while an outsider may be an instance of a possible novel class. The proposed method is able to iteratively select those insiders (resp. outsiders) that are more likely to be members of a novel (resp. an existing) class, and eventually distinguish the actual novel and existing classes accurately. We show how to actively query the labels of the identified novel class instances that are most uncertain. The method also allows us to balance between the rapidity of the novelty detection and its efficiency. Experiments using real world data prove the effectiveness of our approach for both the novel class detection and classification accuracy.

[1]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks , 2008, SAC '08.

[2]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Data Streams with Active Mining , 2010, PAKDD.

[3]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[4]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[5]  Yolande Belaïd,et al.  A Stream-Based Semi-supervised Active Learning Approach for Document Classification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Mohamed Medhat Gaber,et al.  A Survey of Classification Methods in Data Streams , 2007, Data Streams - Models and Algorithms.

[7]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Alípio Mário Jorge,et al.  D-Confidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions , 2012, Journal of the Brazilian Computer Society.

[9]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Jun-Ming Xu,et al.  OASIS: Online Active Semi-Supervised Learning , 2011, AAAI.

[13]  Jean-Charles Lamirel,et al.  Enhancing Classification Accuracy with the Help of Feature Maximization Metric , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[14]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[15]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[16]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[17]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[18]  Jingrui He,et al.  Nearest-Neighbor-Based Active Learning for Rare Category Detection , 2007, NIPS.