论文信息 - An Over-sampling Expert System for Learing from Imbalanced Data Sets

An Over-sampling Expert System for Learing from Imbalanced Data Sets

Learning from imbalanced datasets has become an important branch in the machine learning field. A relatively simple and effective method to solve the imbalance problem is re-sampling, which contains under-sampling and over-sampling. A representative over-sampling approach is SMOTE (synthetic minority over-sampling technique). However, it is not easy to decide the best distribution of minority and majority samples included in a given training set when SMOTE is applied to the imbalance situation. This paper presents an over-sampling expert system to ensemble classifiers trained on the data sets over-sampled at different rates. The proposed combination method, C-SMOTE, applied to several highly and moderately imbalanced data sets can automatically and intelligently obtain an optimal SMOTE rate, and shows improvement in prediction accuracy and overall F-measure on the minority class

[1] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2] N. Japkowicz. Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[3] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[4] Nitesh V. Chawla,et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[5] Peter E. Hart,et al. The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[6] Nathalie Japkowicz,et al. The Class Imbalance Problem: Significance and Strategies , 2000 .

[7] N. Ireland,et al. Learning Rare Class Footprints: the REFLEX Algorithm , 2003 .

[8] Taeho Jo,et al. A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[9] Aiko M. Hormann,et al. Programs for Machine Learning. Part I , 1962, Inf. Control..

[10] Foster J. Provost,et al. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[11] Charles X. Ling,et al. Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[12] C. G. Hilborn,et al. The Condensed Nearest Neighbor Rule , 1967 .

[13] Stan Matwin,et al. Learning When Negative Examples Abound , 1997, ECML.

[14] Stephen Kwek,et al. Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.