Evaluation of Sampling Methods for Learning from Imbalanced Data

The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics.

[1]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[2]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[3]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[4]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[5]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[9]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[10]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[11]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[12]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[14]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[15]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[16]  Hendrik Blockeel,et al.  Knowledge Discovery in Databases: PKDD 2003 , 2003, Lecture Notes in Computer Science.