Mining Data Streams

Knowledge discovery from infinite data streams is an important and difficult task. We are facing two challenges, the overwhelming volume and the concept drifts of the streaming data. In this chapter, we introduce a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

[1]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[2]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[3]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[4]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[5]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[6]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[7]  Philip S. Yu,et al.  A Framework for Scalable Cost-sensitive Learning Based on Combing Probabilities and Benefits , 2002, SDM.

[8]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[9]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[12]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[13]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[14]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[15]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[16]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[17]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[18]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[19]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[20]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[21]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[22]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[23]  Philip S. Yu,et al.  Progressive modeling , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[25]  Philip S. Yu,et al.  Pruning and dynamic scheduling of cost-sensitive ensembles , 2002, AAAI/IAAI.

[26]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[27]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[28]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[29]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[30]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[31]  Like Gao,et al.  Continually evaluating similarity-based pattern queries on a streaming time series , 2002, SIGMOD '02.

[32]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.