Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets

The use of ordering based pruning approaches for ensemble learning in imbalanced classification is proposed.Standard pruning schemes have been adapted to the framework of imbalanced data.BB-Imb and RE-GMmetrics allow a significant gain in the studied models, allowing baseline methodologies to be outperformed.The Boosting Based Imbalanced approach in conjunction with UnderBagging has excelled as the best option.Conclusions are supported by a thorough experimental study with 66 datasets. The scenario of classification with imbalanced datasets has gained a notorious significance in the last years. This is due to the fact that a large number of problems where classes are highly skewed may be found, affecting the global performance of the system. A great number of approaches have been developed to address this problem. These techniques have been traditionally proposed under three different perspectives: data treatment, adaptation of algorithms, and cost-sensitive learning.Ensemble-based models for classifiers are an extension over the former solutions. They consider a pool of classifiers, and they can in turn integrate any of these proposals. The quality and performance of this type of methodology over baseline solutions have been shown in several studies of the specialized literature.The goal of this work is to improve the capabilities of tree-based ensemble-based solutions that were specifically designed for imbalanced classification, focusing on the best behaving bagging- and boosting-based ensembles in this scenario. In order to do so, this paper proposes several new metrics for ordering-based pruning, which are properly adapted to address the skewed-class distribution. From our experimental study we show two main results: on the one hand, the use of the new metrics allows pruning to become a very successful approach in this scenario; on the other hand, the behavior of Under-Bagging model excels, achieving the highest gain with the usage of pruning, since the random undersampled sets that best complement each other can be selected. Accordingly, this scheme is capable of outperforming previous ensemble models selected from the state-of-the-art.

[1]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[2]  Francisco Herrera,et al.  Fuzzy rough classifiers for class imbalanced multi-instance data , 2016, Pattern Recognit..

[3]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[4]  Daniel Hernández-Lobato,et al.  An Analysis of Ensemble Pruning Techniques Based on Ordered Aggregation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[6]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[7]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[8]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[10]  William Nick Street,et al.  Ensemble Pruning Via Semi-definite Programming , 2006, J. Mach. Learn. Res..

[11]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[12]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[13]  Xiaohua Hu,et al.  Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Xindong Wu,et al.  Ensemble pruning via individual contribution ordering , 2010, KDD.

[15]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[16]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[17]  Dae-Ki Kang,et al.  Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction , 2015, Expert Syst. Appl..

[18]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[21]  Nico Karssemeijer,et al.  Learning from unbalanced data: A cascade-based approach for detecting clustered microcalcifications , 2014, Medical Image Anal..

[22]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[23]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[24]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[27]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[28]  Joydeep Ghosh,et al.  Ensembles of $({\alpha})$-Trees for Imbalanced Classification Problems , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Vili Podgorelec,et al.  Improved classification with allocation method and multiple classifiers , 2016, Inf. Fusion.

[30]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[31]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[32]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[33]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[34]  Lawrence B. Holder,et al.  Imbalanced Class Learning in Epigenetics , 2014, J. Comput. Biol..

[35]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[36]  Gerald Schaefer,et al.  A hybrid classifier committee for analysing asymmetry features in breast thermograms , 2014, Appl. Soft Comput..

[37]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[38]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[39]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[40]  KimMyoung-Jong,et al.  Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction , 2015 .

[41]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[42]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[43]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[44]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[45]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[46]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[47]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[48]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[49]  R. Kil,et al.  Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity , 2008 .

[50]  Jun-Hai Zhai,et al.  The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers , 2015, International Journal of Machine Learning and Cybernetics.

[51]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[52]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[53]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[54]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[55]  G. Yule,et al.  On the association of attributes in statistics, with examples from the material of the childhood society, &c , 1900, Proceedings of the Royal Society of London.

[56]  Daniel Hernández-Lobato,et al.  Statistical Instance-Based Pruning in Ensembles of Independent Classifiers , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[58]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[59]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[60]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[61]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[62]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[63]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[64]  Christino Tamon,et al.  On the Boosting Pruning Problem , 2000, ECML.

[65]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[66]  Hadi Sadoghi Yazdi,et al.  Online neural network model for non-stationary and imbalanced data stream classification , 2014, Int. J. Mach. Learn. Cybern..

[67]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[68]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[69]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[70]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[71]  Juan José Rodríguez Diez,et al.  A weighted voting framework for classifiers ensembles , 2012, Knowledge and Information Systems.

[72]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[73]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[74]  Vasudha Bhatnagar,et al.  Towards an optimally pruned classifier ensemble , 2014, International Journal of Machine Learning and Cybernetics.

[75]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[76]  Alberto Suárez,et al.  Aggregation Ordering in Bagging , 2004 .

[77]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[78]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[79]  Gonzalo Martínez-Muñoz,et al.  Using boosting to prune bagging ensembles , 2007, Pattern Recognit. Lett..

[80]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[81]  Ludmila I. Kuncheva Diversity in multiple classifier systems , 2005, Inf. Fusion.

[82]  Samia Boukir,et al.  Margin-based ordered aggregation for ensemble pruning , 2013, Pattern Recognit. Lett..

[83]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[84]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[85]  Hisashi Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009, Stat. Anal. Data Min..

[86]  Gian Luca Foresti,et al.  Diversity-aware classifier ensemble selection via f-score , 2016, Inf. Fusion.