A Novel Ensemble Classification for Data Streams with Class Imbalance and Concept Drift

The processing of streaming data implies new requirements concerning restrictive processing time, limited amount of memory and one scan of incoming instances. One of the biggest challenges facing data stream learning is to deal with concept drift, i.e., the underlying distribution of the data may be evolving over time. Most of the approaches in the literature are under the hypothesis that the distribution of classes is balance. Unfortunately, the class imbalance issue is common in the real-world. And the imbalance issue further increases the difficulty of solving the concept drift problem. Motivated by this challenge, a novel ensemble classification for mining imbalanced streaming data is proposed to overcome both issues simultaneously. The algorithm utilizes the under-sampling and over-sampling techniques to balance the positive and negative instances. Moreover, dynamic weighting strategy was adopted to deal with concept drift. The experimental results on synthetic and real datasets demonstrate that our proposed method performs better than competitive algorithms, especially in situations where there exist concept drift and class imbalance.

[1]  Zhiping Lin,et al.  Weighted Online Sequential Extreme Learning Machine for Class Imbalance Learning , 2013, Neural Processing Letters.

[2]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[3]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[4]  Philip S. Yu,et al.  Mining Concept-Drifting Data Streams , 2010, Data Mining and Knowledge Discovery Handbook.

[5]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[6]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[7]  Geoffrey I. Webb,et al.  Characterizing concept drift , 2015, Data Mining and Knowledge Discovery.

[8]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[9]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[10]  Abraham Kandel,et al.  Real-time data mining of non-stationary data streams from sensor networks , 2008, Inf. Fusion.

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Jerzy Stefanowski,et al.  Accuracy Updated Ensemble for Data Streams with Concept Drift , 2011, HAIS.

[13]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[14]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[15]  Latifur Khan,et al.  IoT Big Data Stream Mining , 2016, KDD.

[16]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[17]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[18]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[19]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[20]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[22]  Qiang Yang,et al.  Test strategies for cost-sensitive decision trees , 2006, IEEE Transactions on Knowledge and Data Engineering.

[23]  Grigorios Tsoumakas,et al.  Tracking recurring contexts using ensemble classifiers: an application to email filtering , 2009, Knowledge and Information Systems.

[24]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.