A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems

Abstract Nowadays, most real-world datasets suffer from the problem of imbalanced distribution of data samples in classes, especially when the number of data representing the larger class (majority) is much greater than that of the smaller class (minority). In order to solve this problem, various types of undersampling or oversampling techniques have been proposed to create a dataset with equal number of samples in each class by reducing or increasing the number of samples in majority or minority classes, respectively. Ensemble classifiers use multiple learning algorithms to enhance the accuracy of classification. Based on the results, combining undersampling or oversampling methods with ensemble classifiers can result in models with better performance. By using both clustering and new undersampling methods, the present study aimed to propose a novel clustering-based undersampling method to create a balanced dataset. This method uses k-means clustering algorithm for clustering the data, Mahalanobis distance to analyze samples distance in each cluster to centroid, and a selection method that preserves the pattern of data distribution in each cluster. Regarding the experimental results obtained by 44 benchmark datasets from KEEL repository, the proposed approach performed better than that of seven state-of-the-art approaches.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Bart Baesens,et al.  An empirical comparison of techniques for the class imbalance problem in churn prediction , 2017, Inf. Sci..

[3]  Yunqian Ma,et al.  Imbalanced Datasets: From Sampling to Classifiers , 2013 .

[4]  Arsham Borumand Saeid,et al.  Fuzzy multi-hop clustering protocol: Selection fuzzy input parameters and rule tuning for WSNs , 2020, Appl. Soft Comput..

[5]  Amit Kumar Tyagi,et al.  Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method , 2019, Int. J. Hybrid Intell. Syst..

[6]  Changyin Sun,et al.  Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data , 2015, Knowl. Based Syst..

[7]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[8]  B. Gupta,et al.  Efficient deep learning approach for augmented detection of Coronavirus disease , 2021, Neural computing & applications.

[9]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Jong-Seok Lee,et al.  AUC4.5: AUC-Based C4.5 Decision Tree Algorithm for Imbalanced Data Classification , 2019, IEEE Access.

[12]  Xin Yao,et al.  Ensemble of Classifiers Based on Multiobjective Genetic Sampling for Imbalanced Data , 2020, IEEE Transactions on Knowledge and Data Engineering.

[13]  Hui He,et al.  Joint computation offloading and task caching for multi-user and multi-task MEC systems: reinforcement learning-based algorithms , 2021, Wireless Networks.

[14]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[15]  M. Vamsi Krishna,et al.  An Optimized Random Forest Classifier for Diabetes Mellitus , 2019 .

[16]  Mahdi Mahfouf,et al.  Performance evaluation of SVM and iterative FSVM classifiers with bootstrapping-based over-sampling and under-sampling , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[17]  Maryam Sabzevari,et al.  Vote-boosting ensembles , 2016, Pattern Recognit..

[18]  Haiyong Zheng,et al.  KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling , 2019, Multimedia Tools and Applications.

[19]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Germano C. Vasconcelos,et al.  Boosting the performance of over-sampling algorithms through under-sampling the minority class , 2019, Neurocomputing.

[23]  Zhihui Li,et al.  Visual saliency guided complex image retrieval , 2020, Pattern Recognit. Lett..

[24]  Lawrence O. Hall,et al.  Synthetic minority image over-sampling technique: How to improve AUC for glioblastoma patient survival prediction , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[25]  Punpiti Piamsa-Nga,et al.  A feature score for classifying class-imbalanced data , 2014, 2014 International Computer Science and Engineering Conference (ICSEC).

[26]  Hamido Fujita,et al.  Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates , 2018, Inf. Sci..

[27]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[28]  José Salvador Sánchez,et al.  DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem , 2020, Expert Syst. Appl..

[29]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[32]  Bahram Sadeghi Bigham,et al.  Over-sampling via under-sampling in strongly imbalanced data , 2016 .

[33]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[34]  Lijun Xie,et al.  A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data , 2018, Pattern Recognit..

[35]  Gurjot Singh Gaba,et al.  A Lightweight and Robust Secure Key Establishment Protocol for Internet of Medical Things in COVID-19 Patients Care , 2020, IEEE Internet of Things Journal.

[36]  Francisco Herrera,et al.  Dynamic ensemble selection for multi-class imbalanced datasets , 2018, Inf. Sci..

[37]  Xinyu Luo,et al.  Cost-sensitive convolutional neural networks for imbalanced time series classification , 2019, Intell. Data Anal..

[38]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[39]  Xi Zhu,et al.  Random forest based classification of alcohol dependence patients and healthy controls using resting state MRI , 2018, Neuroscience Letters.

[40]  Performance Analysis of Under-Sampling and Over-Sampling Techniques for Solving Class Imbalance Problem , 2019, SSRN Electronic Journal.

[41]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[42]  M. Goyal,et al.  A novel framework for risk assessment and resilience of critical infrastructure towards climate change , 2021 .

[43]  Francisco Herrera,et al.  Evolutionary undersampling for extremely imbalanced big data classification under apache spark , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[44]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[45]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[46]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[47]  Nitesh V. Chawla,et al.  3 IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS , 2013 .

[48]  Paweł Zyblewski,et al.  Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams , 2021, Inf. Fusion.

[49]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[50]  Hossein Nezamabadi-pour,et al.  CDBH: A clustering and density-based hybrid approach for imbalanced data classification , 2021, Expert Syst. Appl..

[51]  Dada Emmanuel Gbenga,et al.  Understanding the Limitations of Particle Swarm Algorithm for Dynamic Optimization Tasks , 2016, ACM Comput. Surv..

[52]  Ahmed A. Abd El-Latif,et al.  Efficient quantum-based security protocols for information sharing and data protection in 5G networks , 2019, Future Gener. Comput. Syst..