On the Class Imbalance Problem

The class imbalance problem has been recognized in many practical domains and a hot topic of machine learning in recent years. In such a problem, almost all the examples are labeled as one class, while far fewer examples are labeled as the other class, usually the more important class. In this case, standard machine learning algorithms tend to be overwhelmed by the majority class and ignore the minority class since traditional classifiers seeking an accurate performance over a full range of instances. This paper reviewed academic activities special for the class imbalance problem firstly. Then investigated various remedies in four different levels according to learning phases. Following surveying evaluation metrics and some other related factors, this paper showed some future directions at last.

[1]  Gustavo E. A. P. A. Batista,et al.  Learning with Skewed Class Distributions , 2002 .

[2]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[3]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[4]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[5]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[6]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[7]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[10]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[11]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[12]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[13]  Wang Wen-yuan Over-sampling algorithm based on preliminary classification in imbalanced data sets learning , 2006 .

[14]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[15]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[18]  Maarten van Someren,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004, Machine Learning.

[19]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[20]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[21]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[22]  Nathalie Japkowicz,et al.  Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks , 2004, Machine Learning.

[23]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[24]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[25]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[26]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[27]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[28]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[29]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[30]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[31]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[32]  Nathalie Japkowicz,et al.  A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets , 2001, IDA.

[33]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[34]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[35]  M. Dolores del Castillo,et al.  A multistrategy approach for digital text categorization from imbalanced documents , 2004, SKDD.

[36]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[37]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[38]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[39]  Oren Etzioni,et al.  Representation design and brute-force induction in a Boeing manufacturing domain , 1994, Appl. Artif. Intell..

[40]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[41]  David M. J. Tax,et al.  One-class classification , 2001 .

[42]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[43]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[44]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[45]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[46]  Michael R. Lyu,et al.  Learning classifiers from imbalanced data based on biased minimax probability machine , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[47]  Panayiotis E. Pintelas,et al.  Mixture of Expert Agents for Handling Imbalanced Data Sets , 2003 .

[48]  P. van der Putten,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004 .

[49]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[50]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[51]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[52]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[53]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[54]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[55]  I. Tomek,et al.  Two Modifications of CNN , 1976 .