Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning

Real-world datasets in many domains like medical, intrusion detection, fraud transactions and bioinformatics are highly imbalanced. In classification problems, imbalanced datasets negatively affect the accuracy of class predictions. This skewness can be handled either by oversampling minority class examples or by undersampling majority class. In this work, popular methods of both categories have been evaluated for their capability of improving the imbalanced ratio of five highly imbalanced datasets from different application domains. Effect of balancing on classification results has been also investigated. It has been observed that adaptive synthetic oversampling approach can best improve the imbalance ratio as well as classification results. However, undersampling approaches gave better overall performance on all datasets.

[1]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Anantaporn Hanskunatai A New Hybrid Sampling Approach for Classification of Imbalanced Datasets , 2018, 2018 3rd International Conference on Computer and Communication Systems (ICCCS).

[3]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[4]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[5]  Sheng Chen,et al.  PDFOS: PDF estimation based over-sampling for imbalanced two-class problems , 2014, Neurocomputing.

[6]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[7]  Xiqing Cui,et al.  Imbalanced classification of mental workload using a cost-sensitive majority weighted minority oversampling strategy , 2017, Cognition, Technology & Work.

[8]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[9]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[10]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[11]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[12]  Raju S. Bapi,et al.  An Unbalanced Data Classification Model Using Hybrid Sampling Technique for Fraud Detection , 2007, PReMI.

[13]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).