Learning with limited minority class data

A practical problem in data mining and machine learning is the limited availability of data. For example, in a binary classification problem it is often the case that examples of one class are abundant, while examples of the other class are in short supply. Examples from one class, typically the positive class, can be limited due to the financial cost or time required to collect these examples. This work presents a comprehensive empirical study of learning when examples from one class are extremely rare, but examples of the other class(es) are plentiful. Specifically, we address the issue of how many examples from the abundant class should be used when training a classifier on data where one class is very rare. Nearly one million classifiers were built and evaluated to generate the results presented in this work. Our results demonstrate that the often used 'even distribution' is not optimal when dealing with such rare events.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[5]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[6]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[7]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[10]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[11]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[12]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[13]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.