The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[3]  Kun Zhang,et al.  Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling , 2014, SDM.

[4]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[5]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[6]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[7]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[8]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[9]  Mykola Pechenizkiy,et al.  An Overview of Concept Drift Applications , 2016 .

[10]  Yue Lu,et al.  Latent aspect rating analysis on review text data: a rating regression approach , 2010, KDD.

[11]  Philip S. Yu,et al.  Classifying Data Streams with Skewed Class Distributions and Concept Drifts , 2008, IEEE Internet Computing.

[12]  Peter Tiño,et al.  Concept drift detection for online class imbalance learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[13]  GamaJoão,et al.  A survey on concept drift adaptation , 2014 .

[14]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[15]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[16]  Nan Liu,et al.  Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift , 2015, Neurocomputing.

[17]  Marcin Budka,et al.  Towards cost-sensitive adaptation: When is it worth updating your predictive model? , 2015, Neurocomputing.

[18]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[19]  Bartosz Krawczyk,et al.  Cost-Sensitive Perceptron Decision Trees for Imbalanced Drifting Data Streams , 2017, ECML/PKDD.

[20]  Nitesh V. Chawla,et al.  Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams , 2009, PAKDD Workshops.

[21]  Indre Zliobaite Controlled permutations for testing adaptive learning models , 2013, Knowledge and Information Systems.

[22]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[23]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[24]  Xin Yao,et al.  A Systematic Study of Online Class Imbalance Learning With Concept Drift , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Xin Yao,et al.  Dealing with Multiple Classes in Online Class Imbalance Learning , 2016, IJCAI.

[26]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[27]  Jerzy Stefanowski,et al.  Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data , 2013 .

[28]  Jerzy Stefanowski,et al.  BRACID: a comprehensive approach to learning rules from imbalanced data , 2011, Journal of Intelligent Information Systems.

[29]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[30]  Haibo He,et al.  SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining , 2009, 2009 International Joint Conference on Neural Networks.

[31]  Hadi Sadoghi Yazdi,et al.  Recursive least square perceptron model for non-stationary and imbalanced data stream classification , 2013, Evol. Syst..

[32]  Geoffrey I. Webb,et al.  Analyzing concept drift and shift from sample data , 2018, Data Mining and Knowledge Discovery.

[33]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[34]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[35]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[36]  Xin Yao,et al.  Online Ensemble Learning of Data Streams with Gradually Evolved Classes , 2016, IEEE Transactions on Knowledge and Data Engineering.

[37]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[38]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[39]  João Gama,et al.  Learning with Local Drift Detection , 2006, ADMA.

[40]  Leandro L. Minku,et al.  Class Imbalance Evolution and Verification Latency in Just-in-Time Software Defect Prediction , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[41]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[42]  Myra Spiliopoulou,et al.  MONIC and Followups on Modeling and Monitoring Cluster Transitions , 2013, ECML/PKDD.

[43]  Luís Torgo,et al.  A Survey of Predictive Modelling under Imbalanced Distributions , 2015, ArXiv.

[44]  Jerzy Stefanowski,et al.  Prequential AUC: properties of the area under the ROC curve for data streams with concept drift , 2017, Knowledge and Information Systems.

[45]  Jerzy Stefanowski,et al.  Visual-based analysis of classification measures and their properties for class imbalanced problems , 2018, Inf. Sci..

[46]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[47]  Myra Spiliopoulou,et al.  MONIC: modeling and monitoring cluster transitions , 2006, KDD '06.

[48]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[49]  Jerzy Stefanowski,et al.  On the Dynamics of Classification Measures for Imbalanced and Streaming Data , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Robi Polikar,et al.  Quantifying the limited and gradual concept drift assumption , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[51]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[52]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[53]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[54]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[55]  Shuo Wang,et al.  Resample-Based Ensemble Framework for Drifting Imbalanced Data Streams , 2019, IEEE Access.

[56]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[57]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[58]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[59]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[60]  Jerzy Stefanowski,et al.  Identification of Different Types of Minority Class Examples in Imbalanced Data , 2012, HAIS.

[61]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[62]  João Gama,et al.  A new dynamic modeling framework for credit risk assessment , 2016, Expert Syst. Appl..

[63]  Jerzy Stefanowski,et al.  Local Data Characteristics in Learning Classifiers from Imbalanced Data , 2018, Advances in Data Analysis with Computational Intelligence Methods.

[64]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[65]  Geoffrey I. Webb,et al.  Characterizing concept drift , 2015, Data Mining and Knowledge Discovery.

[66]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[67]  Mateusz Lango,et al.  Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study , 2019, Foundations of Computing and Decision Sciences.

[68]  Nitesh V. Chawla,et al.  Learning in non-stationary environments with class imbalance , 2012, KDD.

[69]  Jerzy Stefanowski,et al.  Ensemble Classifiers for Imbalanced and Evolving Data Streams , 2018 .

[70]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[71]  Herna L. Viktor,et al.  SCUT-DS: Learning from Multi-class Imbalanced Canadian Weather Data , 2018, ISMIS.

[72]  Leandro L. Minku Transfer Learning in Non-stationary Environments , 2018, Studies in Big Data.

[73]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[74]  Joshua D. Knowles,et al.  Hellinger Distance Trees for Imbalanced Streams , 2014, 2014 22nd International Conference on Pattern Recognition.

[75]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[76]  Khaled Ghédira,et al.  Discussion and review on evolving data streams and concept drift adapting , 2018, Evol. Syst..

[77]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[78]  José Salvador Sánchez,et al.  An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets , 2007, CIARP.

[79]  Hadi Sadoghi Yazdi,et al.  Online neural network model for non-stationary and imbalanced data stream classification , 2014, Int. J. Mach. Learn. Cybern..

[80]  Yuan Yan Tang,et al.  Dynamic Weighted Majority for Incremental Learning of Imbalanced Data Streams with Concept Drift , 2017, IJCAI.

[81]  Geoffrey I. Webb,et al.  Survey of distance measures for quantifying concept drift and shift in numeric data , 2018, Knowledge and Information Systems.

[82]  D. Paulraj,et al.  Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model , 2020, Comput. Commun..

[83]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[84]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[85]  Tommaso Toffoli,et al.  Cellular Automata Machines , 1987, Complex Syst..

[86]  Wei Liu,et al.  The Gradual Resampling Ensemble for mining imbalanced data streams with concept drift , 2018, Neurocomputing.

[87]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[88]  Tommaso Toffoli,et al.  Cellular automata machines - a new environment for modeling , 1987, MIT Press series in scientific computation.

[89]  João Gama,et al.  Ensemble learning for data stream analysis: A survey , 2017, Inf. Fusion.