Foundations of Imbalanced Learning

This chapter provides an overview of the imbalanced learning problem and describes some of the key works in the area. It begins by describing what is meant by imbalanced data, and by showing the effects of such data on learning. The chapter then describes the fundamental learning issues that arise when learning from imbalanced data, and categorizes the issues as problem-definition-level issues, data-level issues, or algorithm-level issues. It explains the methods for addressing these issues and organizes these methods using the same three categories. These methods are organized based on whether they operate at the problem definition, data, or algorithm level. As methods are introduced, the underlying issues that they address are highlighted. A section summarizes the foundational problems with imbalanced data, and how they can be addressed by the various methods.

[1]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[2]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[3]  Deborah R. Carvalho,et al.  A genetic-algorithm for discovering small-disjunct rules in data mining , 2002, Appl. Soft Comput..

[4]  Ye Tian,et al.  Maximizing classifier utility when there are data acquisition and modeling costs , 2008, Data Mining and Knowledge Discovery.

[5]  Cory J. Butz,et al.  A Foundational Approach to Mining Itemset Utilities from Databases , 2004, SDM.

[6]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[7]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[8]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[9]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[10]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[11]  Howard J. Hamilton,et al.  Mining itemset utilities from transaction databases , 2006, Data Knowl. Eng..

[12]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[13]  Nick Cercone,et al.  Share Based Measures for Itemsets , 1997, PKDD.

[14]  Oren Etzioni,et al.  Representation design and brute-force induction in a Boeing manufacturing domain , 1994, Appl. Artif. Intell..

[15]  Haym Hirsh,et al.  Learning to Predict Rare Events in Event Sequences , 1998, KDD.

[16]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[17]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[18]  Ada Wai-Chee Fu,et al.  Mining association rules with weighted items , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[19]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[20]  Qiang Yang,et al.  Objective-oriented utility-based association mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[23]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[24]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .