HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

[1]  Santi Wulan Purnami,et al.  Combine Sampling Support Vector Machine for Imbalanced Data Classification , 2015 .

[2]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[3]  Jiong Jin,et al.  Novel feature selection and classification of Internet video traffic based on a hierarchical scheme , 2017, Comput. Networks.

[4]  Guang-Bin Huang,et al.  A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics , 2016, Neural Networks.

[5]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[6]  Bruce Christianson,et al.  Reflections on the NASA MDP data sets , 2012, IET Softw..

[7]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[8]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[9]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[10]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[11]  Tanasanee Phienthrakul,et al.  Combining Over-Sampling and Under-Sampling Techniques for Imbalance Dataset , 2017, ICMLC.

[12]  Yew-Soon Ong,et al.  A Fast Reduced Kernel Extreme Learning Machine , 2016, Neural Networks.

[13]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[14]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[15]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Md Shahinur Rahman,et al.  Synthetic Samples Generation for Imbalance Class Distribution with LSTM Recurrent Neural Networks , 2020, ICCA.

[18]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[19]  Rui Zhou,et al.  A Fuzzy Consensus Clustering Based Undersampling Approach for Class Imbalanced Learning , 2019, ACAI.

[20]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[21]  Zhuoyuan Zheng,et al.  Oversampling Method for Imbalanced Classification , 2015, Comput. Informatics.

[22]  Seema Bawa,et al.  Combining Synthetic Minority Oversampling Technique and Subset Feature Selection Technique For Class Imbalance Problem , 2016 .

[23]  D. Williamson,et al.  The box plot: a simple visual method to interpret data. , 1989, Annals of internal medicine.

[24]  Hans Hagen,et al.  Methods for Presenting Statistical Information: The Box Plot , 2006, VLUDS.

[25]  Godfrey A. Mills,et al.  New Cluster Undersampling Technique for Class Imbalance Learning , 2016 .

[26]  Chih-Cheng Hung,et al.  Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation , 2018, RACS.

[27]  Islam Elgedawy,et al.  HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification , 2020 .