An Empirical Evaluation of Repetitive Undersampling Techniques

Class imbalance is a fundamental problem in data mining and knowledge discovery which is encountered in a wide array of application domains. Random undersampling has been widely used to alleviate the harmful effects of imbalance, however, this technique often leads to a substantial amount of information loss. Repetitive undersampling techniques, which generate an ensemble of models, each trained on a different, undersampled subset of the training data, have been proposed to allieviate this difficulty. This work reviews three repetitive undersampling methods currently used to handle imbalance and presents a detailed and comprehensive empirical study using four different learners, four performance metrics and 15 datasets from various application domains. To our knowledge, this work is the most thorough study of repetitive undersampling techniques.

[1]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[2]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  D. J. Hand,et al.  Good practice in retail credit scorecard assessment , 2005, J. Oper. Res. Soc..

[6]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[7]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..