Boosting methods for multi-class imbalanced data classification: an experimental review

Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.

[1]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[2]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[3]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[4]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[6]  Liu Zhen,et al.  A New Feature Selection Method for Internet Traffic Classification Using ML , 2012 .

[7]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[8]  Sajid Ahmed,et al.  LIUBoost : Locality Informed Underboosting for Imbalanced Data Classification , 2017, ArXiv.

[9]  Faramarz Valafar,et al.  Data mining and knowledge discovery in proton nuclear magnetic resonance (1H-NMR) spectra using frequency to information transformation (FIT) , 2002, Knowl. Based Syst..

[10]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Francisco Herrera,et al.  Exploring the effectiveness of dynamic ensemble selection in the one-versus-one scheme , 2017, Knowl. Based Syst..

[13]  S. H. Shah Newaz,et al.  Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification , 2019, ICMLSC.

[14]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[15]  Kai Ming Ting,et al.  An Empirical Study of MetaCost Using Boosting Algorithms , 2000, ECML.

[16]  Arpit Singh,et al.  A Survey on Methods for Solving Data Imbalance Problem for Classification , 2015 .

[17]  Alexander Vezhnevets,et al.  ‘ Modest AdaBoost ’ – Teaching AdaBoost to Generalize Better , 2005 .

[18]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[19]  Kit Yan Chan,et al.  Time-aware domain-based social influence prediction , 2020, Journal of Big Data.

[20]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[21]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[22]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[25]  Mohammad Sohel Rahman,et al.  isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection , 2017, Artif. Intell. Medicine.

[26]  Saeed Parsa,et al.  A hybrid one-class rule learning approach based on swarm intelligence for software fault prediction , 2015, Innovations in Systems and Software Engineering.

[27]  Chongsheng Zhang,et al.  An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme , 2018, Knowl. Based Syst..

[28]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  Vaishali Ganganwar,et al.  An overview of classification algorithms for imbalanced datasets , 2012 .

[30]  Sohail Asghar,et al.  A Classification Model For Class Imbalance Dataset Using Genetic Programming , 2019, IEEE Access.

[31]  Jitendra Agrawal,et al.  A New approach for Classification of Highly Imbalanced Datasets using Evolutionary Algorithms , 2011 .

[32]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[33]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[34]  Bartosz Krawczyk,et al.  Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[35]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[36]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[37]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[38]  Jie Zhou,et al.  An improved multiclass LogitBoost using adaptive-one-vs-one , 2014, Machine Learning.

[39]  Ping Li,et al.  ABC-LogitBoost for Multi-class Classification , 2009, ArXiv.

[40]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[41]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[42]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[43]  Ieee Xplore,et al.  International Conference on Innovations in Information Technology , 2004 .

[44]  Peng Sun,et al.  AOSO-LogitBoost: Adaptive One-Vs-One LogitBoost for Multi-Class Problem , 2012, ICML.

[45]  Mita Nasipuri,et al.  Significance of non-parametric statistical tests for comparison of classifiers over multiple datasets , 2016, Int. J. Comput. Sci. Math..

[46]  Bartosz Krawczyk Combining One-vs-One Decomposition and Ensemble Learning for Multi-class Imbalanced Data , 2015, CORES.

[47]  Kaiyuan Wu,et al.  BVDT: A Boosted Vector Decision Tree Algorithm for Multi-Class Classification Problems , 2017, Int. J. Pattern Recognit. Artif. Intell..

[48]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[49]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[50]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[51]  Victor S. Sheng,et al.  Cost-Sensitive Learning and the Class Imbalance Problem , 2008 .

[52]  Sajid Ahmed,et al.  MEBoost: Mixing estimators with boosting for imbalanced data classification , 2017, 2017 11th International Conference on Software, Knowledge, Information Management and Applications (SKIMA).

[53]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[54]  Wenji Mao,et al.  Handling Class Imbalance Problem in Cultural Modeling , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[55]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[56]  Mário A. T. Figueiredo,et al.  Boosting Algorithms: A Review of Methods, Theory, and Applications , 2012 .

[57]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[58]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[59]  Wei Feng,et al.  Class imbalance ensemble learning based on the margin theory , 2018 .

[60]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[61]  Xiaobo Jin,et al.  Multi-class AdaBoost with Hypothesis Margin , 2010, 2010 20th International Conference on Pattern Recognition.

[62]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[63]  Safdar Ali,et al.  Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines , 2014, Comput. Methods Programs Biomed..

[64]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[65]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[66]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[67]  Sujit Kumar,et al.  TLUSBoost algorithm: a boosting solution for class imbalance problem , 2018, Soft Comput..

[68]  David Colton,et al.  Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context , 2019, Journal of Computer-Assisted Linguistic Research.

[69]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[70]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[71]  Alicia Fernández,et al.  Improving Electric Fraud Detection using Class Imbalance Strategies , 2012, ICPRAM.

[72]  David Mease Cost-Weighted Boosting with Jittering and Over / Under-Sampling : JOUS-Boost , 2004 .

[73]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[74]  Vitor Miguel Saraiva Esteves Techniques to deal with imbalanced data in multi-class problems: A review of existing methods , 2020 .

[75]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[76]  Muhammad Abulaish,et al.  Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors , 2020, SN Computer Science.

[77]  William E. Strawderman,et al.  Proper Bayes Minimax Estimators of the Multivariate Normal Mean Vector for the Case of Common Unknown Variances , 1973 .

[78]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[79]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[80]  Sankaran Mahadevan,et al.  An improved method to construct basic probability assignment based on the confusion matrix for classification problem , 2016, Inf. Sci..

[81]  Francisco Herrera,et al.  Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data , 2016, Knowl. Based Syst..