Mining with rarity: a unifying framework

Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.

[1]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[2]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[3]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Oren Etzioni,et al.  Representation design and brute-force induction in a Boeing manufacturing domain , 1994, Appl. Artif. Intell..

[9]  Michael J. Pazzani,et al.  Hydra-mm: Learning Multiple Descriptions to Improve Classification Accuracy , 1995, Int. J. Artif. Intell. Tools.

[10]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[11]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[12]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[13]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[14]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[15]  Antal van den Bosch,et al.  When small disjuncts abound, try lazy learning: A case study , 1997 .

[16]  Claire Cardie,et al.  Improving Minority Class Prediction Using Case-Specific Feature Weights , 1997, ICML.

[17]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[18]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[19]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[20]  Ron Kohavi,et al.  Data Mining with MineSet: What Worked, What Did Not, and What Might , 1998 .

[21]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[22]  Haym Hirsh,et al.  Learning to Predict Rare Events in Event Sequences , 1998, KDD.

[23]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[24]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[25]  Gary M. Weiss Timeweaver: a genetic algorithm for identifying predictive patterns in sequences of events , 1999 .

[26]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[27]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[28]  Haym Hirsh,et al.  A Quantitative Study of Small Disjuncts , 2000, AAAI/IAAI.

[29]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[31]  Nathalie Japkowicz,et al.  A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets , 2001, IDA.

[32]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[33]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[34]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[35]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[36]  Deborah R. Carvalho,et al.  A genetic-algorithm for discovering small-disjunct rules in data mining , 2002, Appl. Soft Comput..

[37]  Nathalie Japkowicz,et al.  Supervised Learning with Unsupervised Output Separation , 2002 .

[38]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[39]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[40]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[41]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[42]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[43]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[45]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[46]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[47]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[48]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[49]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[50]  Alex A. Freitas,et al.  New Results for a Hybrid Decision Tree/Genetic Algorithm for Data Mining , 2004 .

[51]  Jerzy W. Grzymala-Busse,et al.  An Approach to Imbalanced Data Sets Based on Changing Rule Strength , 2004, Rough-Neural Computing: Techniques for Computing with Words.

[52]  J. Ross Quinlan Improved Estimates for the Accuracy of Small Disjuncts , 2005, Machine Learning.