Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints

Most existing data stream classification techniques ignore one important aspect of stream data: arrival of a novel class. We address this issue and propose a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive. Novel class detection problem becomes more challenging in the presence of concept-drift, when the underlying data distributions evolve in streams. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. A maximum allowable wait time Tc is imposed as a time constraint to classify a test instance. Furthermore, most existing stream classification approaches assume that the true label of a data point can be accessed immediately after the data point is classified. In reality, a time delay Tl is involved in obtaining the true label of a data point since manual labeling is time consuming. We show how to make fast and correct classification decisions under these constraints and apply them to real benchmark data. Comparison with state-of-the-art stream classification techniques prove the superiority of our approach.

[1]  Dit-Yan Yeung,et al.  Parzen-window network intrusion detectors , 2002, Object recognition supported by user interaction for service robots.

[2]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[3]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[5]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[6]  Bhavani M. Thuraisingham,et al.  A new intrusion detection system using support vector machines and hierarchical clustering , 2007, The VLDB Journal.

[7]  Ralf Klinkenberg,et al.  An Ensemble Classifier for Drifting Concepts , 2005 .

[8]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Quanyuan Wu,et al.  Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[10]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[11]  Philip K. Chan,et al.  Weighting versus pruning in rule validation for detecting network and host anomalies , 2007, KDD '07.

[12]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[13]  Philip K. Chan,et al.  Learning rules for anomaly detection of hostile network traffic , 2003, Third IEEE International Conference on Data Mining.

[14]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[15]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[16]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[17]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Vincenzo Crupi,et al.  Neural-Network-Based System for Novel Fault Detection in Rotating Machinery , 2004 .

[20]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[21]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[22]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.

[24]  Anukool Lakhina,et al.  Multivariate Online Anomaly Detection Using Kernel Recursive Least Squares , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[25]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[26]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[27]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[28]  Stephen J. Roberts,et al.  Extreme value statistics for novelty detection in biomedical signal processing , 2000 .

[29]  Lionel Tarassenko,et al.  Choosing an appropriate model for novelty detection , 1997 .

[30]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks , 2008, SAC '08.

[31]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[32]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.