A study on combining dynamic selection and data preprocessing for imbalance learning

Abstract In real life, classifier learning may encounter a dataset in which the number of instances of a given class is much higher than for other classes. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble classifiers, in such cases, have been reported to yield promising results. Most often, ensembles are specially designed for data level preprocessing techniques that aim to balance class proportions by applying under-sampling and/or over-sampling. Most available studies concentrate on static ensembles designed for different preprocessing techniques. Contrary to static ensembles, dynamic ensembles became popular thanks to their performance in the context of ill defined problems (small size datasets). A dynamic ensemble includes a dynamic selection module for choosing the best ensemble given a test instance. This paper experimentally evaluates the argument that dynamic selection combined with a preprocessing technique can achieve higher performance than static ensemble for imbalanced classification problems. For this evaluation, we collect 84 two-class and 26 multi-class datasets of varying degrees of class-imbalance. In addition, we consider five variations of preprocessing methods and four dynamic selection methods. We further design a useful experimental framework to integrate preprocessing and dynamic selection. Our experiments show that the dynamic ensemble improves the F-measure and the G-mean as compared to the static ensemble. Moreover, considering different levels of imbalance, dynamic selection methods secure higher ranks than other alternatives.

[1]  Xiaoyi Jiang,et al.  Dynamic classifier ensemble model for customer classification with imbalanced class distribution , 2012, Expert Syst. Appl..

[2]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[3]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Kun Zhang,et al.  Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling , 2014, SDM.

[5]  Haibo He,et al.  SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining , 2009, 2009 International Joint Conference on Neural Networks.

[6]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[7]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[8]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[9]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[10]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[13]  Luiz Eduardo Soares de Oliveira,et al.  Dynamic selection of classifiers - A comprehensive review , 2014, Pattern Recognit..

[14]  Antônio de Pádua Braga,et al.  Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Pedro Antonio Gutiérrez,et al.  A dynamic over-sampling procedure based on sensitivity for multi-class problems , 2011, Pattern Recognit..

[16]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[17]  Robert Sabourin,et al.  Dynamic selection approaches for multiple classifier systems , 2011, Neural Computing and Applications.

[18]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[19]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[20]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[21]  H. Finner On a Monotonicity Problem in Step-Down Multiple Test Procedures , 1993 .

[22]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[23]  Grigorios Tsoumakas,et al.  Dealing with Concept Drift and Class Imbalance in Multi-Label Stream Classification , 2011, IJCAI.

[24]  María José del Jesús,et al.  A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets , 2008, Fuzzy Sets Syst..

[25]  George D. C. Cavalcanti,et al.  META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection , 2017, Inf. Fusion.

[26]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[27]  George D. C. Cavalcanti,et al.  Prototype selection for dynamic classifier and ensemble selection , 2016, Neural Computing and Applications.

[28]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[30]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[31]  George D. C. Cavalcanti,et al.  Meta-regression based pool size prediction scheme for dynamic selection of classifiers , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[32]  George D. C. Cavalcanti,et al.  A DEEP analysis of the META-DES framework for dynamic selection of ensemble of classifiers , 2015, ArXiv.

[33]  George D. C. Cavalcanti,et al.  Meta-learning recommendation of default size of classifier pool for META-DES , 2016, Neurocomputing.

[34]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[35]  Kevin W. Bowyer,et al.  Combination of Multiple Classifiers Using Local Accuracy Estimates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[37]  Jun Hu,et al.  KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning , 2016, Neurocomputing.

[38]  José Martínez Sotoca,et al.  Improving the Performance of the RBF Neural Networks Trained with Imbalanced Samples , 2007, IWANN.

[39]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[40]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[41]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[42]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[43]  George D. C. Cavalcanti,et al.  META-DES: A dynamic ensemble selection framework using meta-learning , 2015, Pattern Recognit..

[44]  Sattar Hashemi,et al.  To combat multi-class imbalanced problems by means of over-sampling and boosting techniques , 2014, Soft Computing.

[45]  Ludmila I. Kuncheva,et al.  A Theoretical Study on Six Classifier Fusion Strategies , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[47]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[48]  Giorgio Valentini,et al.  An experimental bias-variance analysis of SVM ensembles based on resampling techniques , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[49]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[50]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[51]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[52]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Giorgio Giacinto,et al.  Synthetic pattern generation for imbalanced learning in image retrieval , 2012, Pattern Recognit. Lett..

[54]  Nojun Kwak,et al.  Feature extraction for classification problems and its application to face recognition , 2008, Pattern Recognit..

[55]  Xin Yao,et al.  Dealing with Multiple Classes in Online Class Imbalance Learning , 2016, IJCAI.

[56]  Robert Sabourin,et al.  From dynamic classifier selection to dynamic ensemble selection , 2008, Pattern Recognit..

[57]  Amar Mitiche,et al.  Classifier combination for hand-printed digit recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[58]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[59]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[60]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[61]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[62]  George D. C. Cavalcanti,et al.  Dynamic classifier selection: Recent advances and perspectives , 2018, Inf. Fusion.

[63]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[64]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[65]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[66]  Phayung Meesad,et al.  A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition , 2014, Expert Syst. Appl..

[67]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[68]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[69]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[70]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[71]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[72]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[73]  George D. C. Cavalcanti,et al.  META-DES.H: A Dynamic Ensemble Selection technique using meta-learning and a dynamic weighting approach , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[74]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[75]  Robert Sabourin,et al.  LoGID: An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of HMMs , 2012, Pattern Recognit..