Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

The classifier built from a data set with a highly skewed class distribution generally predicts the more frequently occurring classes much more often than the infrequently occurring classes. This is largely due to the fact that most classifiers are designed to maximize accuracy. In many instances, such as for medical diagnosis, this classification behavior is unacceptable because the minority class is the class of primary interest (i.e., it has a much higher misclassification cost than the majority class). In this paper we compare three methods for dealing with data that has a skewed class distribution and nonuniform misclassification costs. The first method incorporates the misclassification costs into the learning algorithm while the other two methods employ oversampling or undersampling to make the training data more balanced. In this paper we empirically compare the effectiveness of these methods in order to determine which produces the best overall classifier—and under what circumstances.

[1]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[2]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[3]  J. Saver,et al.  A Quantitative Study , 2005 .

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[7]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[8]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[9]  Gary Weiss,et al.  Improving classifier utility by altering the misclassification cost ratio , 2005, UBDM '05.

[10]  J. R. Quinlan,et al.  Data Mining Tools See5 and C5.0 , 2004 .

[11]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[12]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[13]  Edwin P. D. Pednault,et al.  The Importance of Estimation Errors in Cost-Sensitive Learning , .

[14]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[15]  Haym Hirsh,et al.  A Quantitative Study of Small Disjuncts , 2000, AAAI/IAAI.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Chris. Drummond,et al.  C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling , 2003 .

[18]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.