Entropy and Confidence-Based Undersampling Boosting Random Forests for Imbalanced Problems

In this article, we propose a novel entropy and confidence-based undersampling boosting (ECUBoost) framework to solve imbalanced problems. The boosting-based ensemble is combined with a new undersampling method to improve the generalization performance. To avoid losing informative samples during the data preprocessing of the boosting-based ensemble, both confidence and entropy are used in ECUBoost as benchmarks to ensure the validity and structural distribution of the majority samples during the undersampling. Furthermore, different from other iterative dynamic resampling methods, ECUBoost based on confidence can be applied to algorithms without iterations such as decision trees. Meanwhile, random forests are used as base classifiers in ECUBoost. Furthermore, experimental results on both artificial data sets and KEEL data sets prove the effectiveness of the proposed method.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  I. Wald,et al.  On building fast kd-Trees for Ray Tracing, and on doing that in O(N log N) , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[5]  Zhe Wang,et al.  Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems , 2019, Neural Networks.

[6]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[7]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[8]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[9]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[12]  Ali A. Ghorbani,et al.  IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS 1 Toward Credible Evaluation of Anomaly-Based Intrusion-Detection Methods , 2022 .

[13]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[17]  Zhu Changming,et al.  Integrated Fisher linear discriminants , 2014 .

[18]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[22]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[23]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[24]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[25]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[26]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[27]  Victor S. Sheng,et al.  Cost-Sensitive Learning , 2009, Encyclopedia of Data Warehousing and Mining.

[28]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[29]  Xizhao Wang,et al.  Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Jun Ding,et al.  Integrated Fisher linear discriminants: An empirical study , 2014, Pattern Recognit..

[31]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[32]  Qi Fan,et al.  One-sided Dynamic Undersampling No-Propagation Neural Networks for imbalance problem , 2016, Eng. Appl. Artif. Intell..

[33]  Hongyuan Zha,et al.  Entropy-based fuzzy support vector machine for imbalanced datasets , 2017, Knowl. Based Syst..

[34]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[35]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[36]  Zhe Wang,et al.  Geometric Structural Ensemble Learning for Imbalanced Problems , 2020, IEEE Transactions on Cybernetics.

[37]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[38]  Wang Zhe,et al.  Pseudo-inverse linear discriminants for the improvement of overall classification accuracies , 2016, Neural Networks.

[39]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[40]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[41]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[42]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[43]  Yuan Yan Tang,et al.  Hybrid Sampling with Bagging for Class Imbalance Learning , 2016, PAKDD.

[44]  Hongyuan Zha,et al.  Boundary-Eliminated Pseudoinverse Linear Discriminant for Imbalanced Problems , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Kent A. Spackman,et al.  Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning , 1989, ML.

[46]  Daniel S. Yeung,et al.  Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems , 2015, IEEE Transactions on Cybernetics.

[47]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[48]  Zhong Liu,et al.  A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM , 2017, Comput. Intell. Neurosci..

[49]  María José del Jesús,et al.  KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[50]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[51]  David Zhang,et al.  Combination of linear regression classification and collaborative representation classification , 2014, Neural Computing and Applications.

[52]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[53]  Hong-Liang Dai,et al.  Class imbalance learning via a fuzzy total margin based support vector machine , 2015, Appl. Soft Comput..

[54]  S. Williams,et al.  Pearson's correlation coefficient. , 1996, The New Zealand medical journal.

[55]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[56]  Bernard Widrow,et al.  The No-Prop algorithm: A new learning algorithm for multilayer neural networks , 2013, Neural Networks.

[57]  David L. Donoho,et al.  Precise Undersampling Theorems , 2010, Proceedings of the IEEE.

[58]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[59]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.