The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

This study discusses the effects of class imbalance and training data size on the predictive performance of classifiers. An empirical study was performed on ten classifiers arising from seven categories, which are frequently employed and have been identified to be efficient. In addition, comprehensive hyperparameter tuning was done for every data to maximize the performance of each classifier. The results indicated that (1) naive Bayes, logistic regression and logit leaf model are less susceptible to class imbalance while they have relatively poor predictive performance; (2) ensemble classifiers AdaBoost, XGBoost and parRF have a quite poorer stability in terms of class imbalance while they achieved superior predictive accuracies; (3) for all of the classifiers employed in this study, their accuracies decreased as soon as the class imbalance skew reached a certain point 0.10; note that although using datasets with balanced class distribution would be an ideal condition to maximize the performance of classifiers, if the skew is larger than 0.10, a comprehensive hyperparameter tuning may be able to eliminate the effect of class imbalance; (4) no one classifier shows to be robust to the change of training data size; (5) CART is the last choice among the ten classifiers.

[1]  Jerzy Stefanowski,et al.  Local Data Characteristics in Learning Classifiers from Imbalanced Data , 2018, Advances in Data Analysis with Computational Intelligence Methods.

[2]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[3]  Giles M. Foody,et al.  Crop classification by support vector machine with intelligently selected training data for an operational application , 2008 .

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Fang-Cheng Yeh,et al.  Small Data Challenge: Structural Analysis and Optimization of Convolutional Neural Networks with a Small Sample Size , 2018 .

[6]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[7]  Dirk Söffker,et al.  Does Classifier Fusion Improve the Overall Performance? Numerical Analysis of Data and Fusion Method Characteristics Influencing Classifier Fusion Performance , 2019, Entropy.

[8]  Giles M. Foody,et al.  A relative evaluation of multiclass image classification by support vector machines , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[9]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[10]  Aamer Nadeem,et al.  Analyses of Classifier’s Performance Measures Used in Software Fault Prediction Studies , 2019, IEEE Access.

[11]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zi Huang,et al.  Self-taught dimensionality reduction on the high-dimensional small-sized data , 2013, Pattern Recognit..

[14]  Vicente García,et al.  Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction , 2019, Inf. Fusion.

[15]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[16]  José Martínez Sotoca,et al.  An analysis of how training data complexity affects the nearest neighbor classifiers , 2007, Pattern Analysis and Applications.

[17]  Paul M. Mather,et al.  An assessment of the effectiveness of decision tree methods for land cover classification , 2003 .

[18]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[19]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[20]  Charless C. Fowlkes,et al.  Do We Need More Training Data? , 2015, International Journal of Computer Vision.

[21]  Arno De Caigny,et al.  A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees , 2018, Eur. J. Oper. Res..

[22]  Foster Provost,et al.  The effect of class distribution on classifier learning , 2001 .

[23]  Brendan J. Frey,et al.  Are Random Forests Truly the Best Classifiers? , 2016, J. Mach. Learn. Res..

[24]  Beizhan Wang,et al.  A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis , 2019, Pattern Recognit..

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jana Kosecka,et al.  Synthesizing Training Data for Object Detection in Indoor Scenes , 2017, Robotics: Science and Systems.

[27]  Alicia Pérez,et al.  Smoothing dense spaces for improved relation extraction between drugs and adverse reactions , 2019, Int. J. Medical Informatics.