SMOTEBoost: Improving Prediction of the Minority Class in Boosting

Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.

[1]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Carey E. Priebe,et al.  COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[4]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[5]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[6]  Fredric C. Gey,et al.  The Relationship between Recall and Precision , 1994, J. Am. Soc. Inf. Sci..

[7]  Thomas Lindner,et al.  Task Description , 1995, Formal Development of Reactive Systems.

[8]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[11]  Salvatore J. Stolfo,et al.  Data Mining Approaches for Intrusion Detection , 1998, USENIX Security Symposium.

[12]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[13]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[14]  John Shawe-Taylor,et al.  Optimizing Classifers for Imbalanced Training Sets , 1998, NIPS.

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Jianxiong Luo INTEGRATING FUZZY LOGIC WITH DATA MINING METHODS FOR INTRUSION DETECTION , 1999 .

[17]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[18]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[19]  R.K. Cunningham,et al.  Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[20]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[21]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[22]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[23]  E. Bloedorn,et al.  Data mining for network intrusion detection : How to get started , 2001 .

[24]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25]  Sushil Jajodia,et al.  Detecting Novel Network Intrusions Using Bayes Estimators , 2001, SDM.

[26]  Ramesh C. Agarwal,et al.  PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection) , 2001, SDM.

[27]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[28]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[29]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[30]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[31]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.