Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as “Are all learning paradigms equally affected by class imbalance?”, “What is the expected performance loss for different imbalance degrees?” and “How much of the performance losses can be recovered by the treatment methods?”. In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5 %) for the most balanced distributions up to 10 % of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20 % for 1 % of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30 % or less of the performance that was lost due to class imbalance was recovered by these methods.

[1]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[3]  G. Foody Classification accuracy comparison: hypothesis tests and the use of confidence intervals in evaluations of difference, equivalence and non-inferiority , 2009 .

[4]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[6]  Victor S. Sheng,et al.  Cost-Sensitive Learning , 2009, Encyclopedia of Data Warehousing and Mining.

[7]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[8]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[9]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[10]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[14]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[15]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[16]  B. M. Bennett,et al.  212. Note: Confidence Limits for a Ratio Using Wilcoxon's Signed Rank Test , 1965 .

[17]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  José Antonio Lozano,et al.  Significance tests or confidence intervals: which are preferable for the comparison of classifiers? , 2013, J. Exp. Theor. Artif. Intell..

[20]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Stan Matwin,et al.  Cost-Sensitive Boosting Algorithms for Imbalanced Multi-instance Datasets , 2013, Canadian Conference on AI.

[23]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[24]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[25]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[26]  Zhaolei Zhang,et al.  Modifying kernels using label information improves SVM classification performance , 2007, ICMLA 2007.

[27]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[28]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[29]  Gustavo E. A. P. A. Batista,et al.  A Survey on Graphical Methods for Classification Predictive Performance Evaluation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[30]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[31]  David A. Cieslak,et al.  Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ , 2008, PAKDD.

[32]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[33]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[35]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[36]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..