Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach

Difficulties of learning from nonstationary data stream are generally twofold. First, dynamically structured learning framework is required to catch up with the evolution of unstable class concepts, i.e., concept drifts. Second, imbalanced class distribution over data stream demands a mechanism to intensify the underrepresented class concepts for improved overall performance. To alleviate the challenges brought by these issues, we propose the recursive ensemble approach (REA) in this paper. To battle against the imbalanced learning problem in training data chunk received at any timestamp t, i.e., $${{\mathcal{S}}_t,}$$ REA adaptively pushes into $${{\mathcal{S}}_t}$$part of minority class examples received within [0, t − 1] to balance its skewed class distribution. Hypotheses are then progressively developed over time for all balanced training data chunks and combined together as an ensemble classifier in a dynamically weighted manner, which therefore addresses the concept drifts issue in time. Theoretical analysis proves that REA can provide less erroneous prediction results than a comparative algorithm. Besides that, empirical study on both synthetic benchmarks and real-world data set is also applied to validate effectiveness of REA as compared with other algorithms in terms of evaluation metrics consisting of overall prediction accuracy and ROC curve.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Stephen Grossberg,et al.  Nonlinear neural networks: Principles, mechanisms, and architectures , 1988, Neural Networks.

[3]  Dimitar Filev,et al.  Gustafson-Kessel algorithm for evolving data stream clustering , 2009, CompSysTech '09.

[4]  Nuno Vasconcelos,et al.  Asymmetric boosting , 2007, ICML '07.

[5]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[6]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[7]  Haibo He,et al.  IMORL: Incremental Multiple-Object Recognition and Localization , 2008, IEEE Transactions on Neural Networks.

[8]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[9]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[10]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[11]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  Haibo He,et al.  SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining , 2009, 2009 International Joint Conference on Neural Networks.

[14]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[15]  Vasant Honavar,et al.  Learn++: an incremental learning algorithm for supervised neural networks , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[16]  Robi Polikar,et al.  Learn$^{++}$ .NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for Efficient Incremental Learning of New Classes , 2009, IEEE Transactions on Neural Networks.

[17]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[18]  Carlo Zaniolo,et al.  An Adaptive Nearest Neighbor Classification Algorithm for Data Streams , 2005, PKDD.

[19]  P. Angelov,et al.  Evolving Fuzzy Systems from Data Streams in Real-Time , 2006, 2006 International Symposium on Evolving Fuzzy Systems.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[21]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[22]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[23]  Plamen Angelov,et al.  An Extended Version of the GustafsonKessel Algorithm for Evolving Data Stream Clustering , 2010 .

[24]  Igor Škrjanc,et al.  Predictive functional control based on an adaptive fuzzy model of a hybrid semi-batch reactor , 2010 .

[25]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[26]  Haibo He,et al.  MuSeRA: Multiple Selectively Recursive Approach towards imbalanced stream data mining , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[27]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[28]  LastMark Online classification of nonstationary data streams , 2002 .

[29]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[30]  Plamen Angelov,et al.  Evolving Intelligent Systems: Methodology and Applications , 2010 .

[31]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[32]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[33]  Steffen Lange,et al.  On the power of incremental learning , 2002, Theor. Comput. Sci..

[34]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[35]  Arun Sharma,et al.  A Note on Batch and Incremental Learnability , 1998, J. Comput. Syst. Sci..

[36]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[37]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[38]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[39]  Mohamed Medhat Gaber,et al.  Adaptive mining techniques for data streams using algorithm output granularity , 2003 .