Exploring data sampling techniques for imbalanced classification problems

The class imbalance problem is one of the key challenges in machine learning and data mining. Imbalanced data can result in the sub-optimal performance of classification models. To address the problem, a variety of data sampling methods have been proposed in previous studies. However, there is no universal solution and it is worth to explore which kind of data sampling technique is more effective in balancing class distribution in terms of the type of data and classifier. In this work, we present an experimental study based on a number of real-world data sets obtained from different disciplines. The goal is to investigate different sampling techniques in terms of the effectiveness of increasing the classification performance in imbalanced data sets. In particular, we study ten sampling methods of different types, including random sampling, clusterbased sampling, ensemble sampling and so on. Besides, the C4.5 decision tree algorithm is used to train the base classifiers and the performance is measured by using precision, G-Measure and Cohen's Kappa statistic.

[1]  C Y Wang,et al.  imDC: an ensemble learning method for imbalanced classification with miRNA data. , 2015, Genetics and molecular research : GMR.

[2]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[3]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Gang Wang Asymmetric Random Subspace Method for Imbalanced Credit Risk Evaluation , 2012 .

[8]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[9]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[10]  Haibo He,et al.  A Ranked Subspace Learning Method for Gene Expression Data Classification , 2007, IC-AI.

[11]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[12]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[13]  M. Mostafizur Rahman,et al.  Cluster Based Under-Sampling for Unbalanced Cardiovascular Data , 2013 .

[14]  Mohammad Zulkernine,et al.  Network Intrusion Detection using Random Forests , 2005, PST.

[15]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[17]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[18]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[19]  Sebastián Ventura,et al.  Weighted Data Gravitation Classification for Standard and Imbalanced Data , 2013, IEEE Transactions on Cybernetics.