Learning with Skewed Class Distributions

Several aspects may influence the performance achieved by a classifier created by a Machine Learning system. One of these aspects is related to the dierence between the numbers of examples belonging to each class. When this dierence is large, the learning system may have diculties to learn the concept related to the minority class. In this work 1 , we discuss several issues related to learning with skewed class distributions, such as the relationship between cost-sensitive learning and class distributions, and the limitations of accuracy and error rate to measure the performance of classifiers. Also, we survey some methods proposed by the Machine Learning community to solve the problem of learning with imbalanced data sets, and discuss some limitations of these methods.

[1]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[6]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[7]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[8]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[9]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[10]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[11]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Edwin P. D. Pednault,et al.  Handling Imbalanced Data Sets in Insurance Risk Modeling , 2000 .

[14]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Applying One-Sided Selection to Unbalanced Datasets , 2000, MICAI.

[15]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[16]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[17]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[18]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .