Mining concept-drifting data streams using ensemble classifiers

Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

[1]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[2]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[3]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[4]  L. Breiman Pasting Bites Together For Prediction In Large Data Sets And On-Line , 1996 .

[5]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[6]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[9]  Philip S. Yu,et al.  Progressive modeling , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[11]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[12]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[13]  Lawrence O. Hall,et al.  Distributed Learning on Very Large Data Sets , 2000 .

[14]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[15]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[16]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[17]  Philip S. Yu,et al.  Inductive Learning in Less Than One Sequential Data Scan , 2003, IJCAI.

[18]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[19]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[20]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[21]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  RamakrishnanRaghu,et al.  BOAToptimistic decision tree construction , 1999 .

[24]  Philip S. Yu,et al.  Pruning and dynamic scheduling of cost-sensitive ensembles , 2002, AAAI/IAAI.

[25]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[26]  Philip S. Yu,et al.  A Framework for Scalable Cost-sensitive Learning Based on Combing Probabilities and Benefits , 2002, SDM.

[27]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[28]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[29]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[30]  Like Gao,et al.  Continually evaluating similarity-based pattern queries on a streaming time series , 2002, SIGMOD '02.

[31]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[32]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.