Data Preprocessing and Dynamic Ensemble Selection for Imbalanced Data Stream Classification

Learning from the non-stationary imbalanced data stream is a serious challenge to the machine learning community. There is a significant number of works addressing the issue of classifying non-stationary data stream, but most of them do not take into consideration that the real-life data streams may exhibit high and changing class imbalance ratio, which may complicate the classification task. This work attempts to connect two important, yet rarely combined, research trends in data analysis, i.e., non-stationary data stream classification and imbalanced data classification. We propose a novel framework for training base classifiers and preparing the dynamic selection dataset (DSEL) to integrate data preprocessing and dynamic ensemble selection (DES) methods for imbalanced data stream classification. The proposed approach has been evaluated on the basis of computer experiments carried out on 72 artificially generated data streams with various imbalance ratios, levels of label noise and types of concept drift. In addition, we consider six variations of preprocessing methods and four DES methods. Experimentation results showed that dynamic ensemble selection, even without the use of any data preprocessing, can outperform a naive combination of the whole pool generated with the use of preprocessing methods. Combining DES with preprocessing further improves the obtained results.

[1]  João Gama,et al.  Ensemble learning for data stream analysis: A survey , 2017, Inf. Fusion.

[2]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[3]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[4]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Marek Kurzynski,et al.  Optimal selection of ensemble classifiers using measures of competence and diversity of base classifiers , 2014, Neurocomputing.

[7]  Hien M. Nguyen,et al.  Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[8]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[9]  Michal Wozniak,et al.  Classifier Selection for Highly Imbalanced Data Streams with Minority Driven Ensemble , 2019, ICAISC.

[10]  Robert Sabourin,et al.  From dynamic classifier selection to dynamic ensemble selection , 2008, Pattern Recognit..

[11]  Nojun Kwak,et al.  Feature extraction for classification problems and its application to face recognition , 2008, Pattern Recognit..

[12]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[13]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[14]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[15]  George D. C. Cavalcanti,et al.  Dynamic classifier selection: Recent advances and perspectives , 2018, Inf. Fusion.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Marek Kurzynski,et al.  A probabilistic model of classifier competence for dynamic ensemble selection , 2011, Pattern Recognit..

[18]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[19]  Emmanuel Bacry,et al.  tick: a Python Library for Statistical Learning, with an emphasis on Hawkes Processes and Time-Dependent Models , 2017, J. Mach. Learn. Res..

[20]  George D. C. Cavalcanti,et al.  META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection , 2017, Inf. Fusion.

[21]  Anne M. P. Canuto,et al.  Using Accuracy and Diversity to Select Classifiers to Build Ensembles , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[22]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.