A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

[1]  Ning Chen,et al.  Weighted Learning Vector Quantization to Cost-Sensitive Learning , 2010, ICANN.

[2]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[3]  Chris. Drummond,et al.  C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling , 2003 .

[4]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[5]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[6]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[7]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[8]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[9]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[10]  Liu Yu Over-sampling algorithm based on negative immune in imbalanced data sets learning , 2010 .

[11]  Song Zhi-huan Mining Knowledge from Unbalanced Data: Effect of Class Distribution on SVM Classification , 2005 .

[12]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[13]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[14]  Zhi-Hua Zhou,et al.  Learning with cost intervals , 2010, KDD '10.