论文信息 - SMOTE: Synthetic Minority Over-sampling Technique - 字舞流文

SMOTE: Synthetic Minority Over-sampling Technique

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

Nitesh V. Chawla | Lawrence O. Hall | Kevin W. Bowyer | W. Philip Kegelmeyer | K. Bowyer | W. Kegelmeyer | N. Chawla | L. Hall

[1] David G. Stork,et al. Pattern Classification , 1973 .

[2] I. Tomek,et al. Two Modifications of CNN , 1976 .

[3] C. J. van Rijsbergen,et al. The selection of good search terms , 1981, Inf. Process. Manag..

[4] David L. Waltz,et al. Toward memory-based reasoning , 1986, CACM.

[5] J A Swets,et al. Measuring the accuracy of diagnostic systems. , 1988, Science.

[6] Lemont B. Kier,et al. The electrotopological state: structure information at the atomic level for molecular graphs , 1991, J. Chem. Inf. Comput. Sci..

[7] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[8] Carey E. Priebe,et al. COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[9] Michael J. Pazzani,et al. Reducing Misclassification Costs , 1994, ICML.

[10] David D. Lewis,et al. Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[11] Joseph O'Rourke,et al. Computational Geometry in C. , 1995 .

[12] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .

[13] William W. Cohen. Fast Effective Rule Induction , 1995, ICML.

[14] Moninder Singh,et al. Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management , 1996, ICML.

[15] Tom Fawcett,et al. Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[16] Yoram Singer,et al. Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[17] A. S. Schistad Solberg,et al. A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[18] Horst Bunke,et al. Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[20] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[21] Susan T. Dumais,et al. Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[22] Charles X. Ling,et al. Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[23] Ron Kohavi,et al. The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[24] J. O´Rourke,et al. Computational Geometry in C: Arrangements , 1998 .

[25] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[26] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[27] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[28] Sauchi Stephen Lee. Noisy replication in skewed binary classification , 2000 .

[29] Nathalie Japkowicz,et al. The Class Imbalance Problem: Significance and Strategies , 2000 .

[30] Robert C. Holte,et al. Explicitly representing expected cost: an alternative to ROC representation , 2000, KDD '00.

[31] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[32] Tom Fawcett,et al. Robust Classification for Imprecise Environments , 2000, Machine Learning.

[33] Stan Matwin,et al. Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[34] Steven Salzberg,et al. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.