A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Foster Provost,et al.  The effect of class distribution on classifier learning , 2001 .

[3]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[4]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[5]  Kun Zhang,et al.  Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[7]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[8]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  Yiming Yang,et al.  Probabilistic score estimation with piecewise logistic regression , 2004, ICML.

[11]  S. Muthukrishnan,et al.  Modeling skew in data streams , 2006, SIGMOD Conference.

[12]  Philip S. Yu,et al.  Mining Extremely Skewed Trading Anomalies , 2004, EDBT.

[13]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[14]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[15]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.