A multiple combined method for rebalancing medical data with class imbalances

Most classification algorithms assume that classes are in a balanced state. However, datasets with class imbalances are everywhere. The classes of actual medical datasets are imbalanced, severely impacting identification models and even sacrificing the classification accuracy of the minority class, even though it is the most influential and representative. The medical field has irreversible characteristics. Its tolerance rate for misjudgment is relatively low, and errors may cause irreparable harm to patients. Therefore, this study proposes a multiple combined method to rebalance medical data featuring class imbalances. The combined methods include (1) resampling methods (synthetic minority oversampling technique [SMOTE] and undersampling [US]), (2) particle swarm optimization (PSO), and (3) MetaCost. This study conducted two experiments with nine medical datasets to verify and compare the proposed method with the listing methods. A decision tree is used to generate decision rules for easy understanding of the research results. The results show that (1) the proposed method with ensemble learning can improve the area under a receiver operating characteristic curve (AUC), recall, precision, and F1 metrics; (2) MetaCost can increase sensitivity; (3) SMOTE can effectively enhance AUC; (4) US can improve sensitivity, F1, and misclassification costs in data with a high-class imbalance ratio; and (5) PSO-based attribute selection can increase sensitivity and reduce data dimension. Finally, we suggest that the dataset with an imbalanced ratio >9 must use the US results to make the decision. As the imbalanced ratio is < 9, the decision-maker can simultaneously consider the results of SMOTE and US to identify the best decision.

[1]  Mohamed Haouari,et al.  Review of optimization techniques applied for the integration of distributed generation from renewable energy sources , 2017 .

[2]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[3]  Mohd Shahizan Othman,et al.  Review of feature selection for solving classification problems , 2013 .

[4]  Chung-Ho Hsieh,et al.  Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks. , 2011, Surgery.

[5]  Saima Anwar Lashari,et al.  Application of Data Mining Techniques for Medical Data Classification: A Review , 2018 .

[6]  Verónica Bolón-Canedo,et al.  A review of feature selection methods in medical applications , 2019, Comput. Biol. Medicine.

[7]  Cunjun Wang,et al.  A novel deep metric learning model for imbalanced fault diagnosis and toward open-set classification , 2021, Knowl. Based Syst..

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Yuh-Jye Lee,et al.  Anomaly Detection via Online Oversampling Principal Component Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[11]  Hossam A. Nabwey,et al.  An Intelligent Mining Model for Medical Diagnosis of Heart Disease Based on Rough Set Data Analysis , 2020 .

[12]  Arputharaj Kannan,et al.  Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection , 2020, Comput. Biol. Medicine.

[13]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[14]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[15]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[16]  Jamal Shahrabi,et al.  Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. , 2014, Diabetes research and clinical practice.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[20]  Mengjie Zhang,et al.  Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms , 2014, Appl. Soft Comput..

[21]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[22]  Abdulhamit Subasi,et al.  Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders , 2013, Comput. Biol. Medicine.

[23]  Daehan Won,et al.  Classification of Cervical Cancer Dataset , 2018, ArXiv.

[24]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[25]  Paul H. Lee,et al.  Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets , 2014, International journal of environmental research and public health.

[26]  Ronaldo Corrêa Ferreira da Silva Guide to Cancer Early Diagnosis , 2019, Revista Brasileira de Cancerologia.

[27]  A. Rajkumar,et al.  Diagnosis Of Heart Disease Using Datamining Algorithm , 2010 .

[28]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[29]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[30]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[31]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[32]  Saman Forouzandeh,et al.  Integration of multi-objective PSO based feature selection and node centrality for medical datasets. , 2020, Genomics.

[33]  Divya Tomar,et al.  A survey on Data Mining approaches for Healthcare , 2013, BSBT 2013.

[34]  J. Morgan Varner,et al.  Modelling post-fire tree mortality: Can random forest improve discrimination of imbalanced data? , 2019 .

[35]  Liu Yang,et al.  A Classification Method for Class-Imbalanced Data and Its Application on Bioinformatics , 2010 .

[36]  Lamiaa M. El Bakrawy,et al.  Improved Prediction of Post-operative Life Expectancy after Thoracic Surgery , 2016 .

[37]  Kung-Jeng Wang,et al.  Probabilistic modeling of short survivability in patients with brain metastasis from lung cancer , 2015, Comput. Methods Programs Biomed..

[38]  Kun-Huang Chen,et al.  A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients , 2014, Appl. Soft Comput..

[39]  G. Garrido Cantarero,et al.  [The area under the ROC curve]. , 1996, Medicina clinica.

[40]  Soni Jyoti,et al.  Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , 2011 .

[41]  Atefeh DARAEI,et al.  An Efficient Predictive Model for Myocardial Infarction Using Cost-sensitive J48 Model , 2017, Iranian journal of public health.

[42]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[43]  Yoga Pristyanto,et al.  Hybrid Resampling for Imbalanced Class Handling on Web Phishing Classification Dataset , 2019, 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE).

[44]  Robert C. Holte,et al.  Explicitly representing expected cost: an alternative to ROC representation , 2000, KDD '00.

[45]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[46]  Abdulhamit Subasi,et al.  Detection of congestive heart failures using C4.5 Decision Tree , 2013, SOCO 2013.

[47]  Hsu-Hao Yang,et al.  Rough sets to help medical diagnosis - Evidence from a Taiwan's clinic , 2009, Expert Syst. Appl..

[48]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[49]  Cornelis J. Stam,et al.  Random forest to differentiate dementia with Lewy bodies from Alzheimer's disease , 2016, Alzheimer's & dementia.

[50]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[51]  Zhongguo Yang,et al.  Choosing Classification Algorithms and Its Optimum Parameters based on Data Set Characteristics , 2017 .

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  Siti Mariyam Shamsuddin,et al.  Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE , 2019, Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017).

[54]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[55]  Ketan Machhale,et al.  MRI brain cancer classification using hybrid classifier (SVM-KNN) , 2015, 2015 International Conference on Industrial Instrumentation and Control (ICIC).

[56]  Stan Matwin,et al.  Classifying Severely Imbalanced Data , 2011, Canadian Conference on AI.

[57]  Mohammad Ehsan Basiri,et al.  Particle Swarm Optimization for Feature Selection in Speaker Verification , 2010, EvoApplications.

[58]  Victor S. Sheng,et al.  Cost-Sensitive Learning and the Class Imbalance Problem , 2008 .

[59]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[60]  Safdar Ali,et al.  Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines , 2014, Comput. Methods Programs Biomed..

[61]  Roohallah Alizadehsani,et al.  Diagnosis of Coronary Artery Disease Using Cost-Sensitive Algorithms , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[62]  Harikumar Rajaguru,et al.  A Comprehensive Analysis on Breast Cancer Classification with Radial Basis Function and Gaussian Mixture Model , 2017 .

[63]  Yao Hu,et al.  New imbalanced bearing fault diagnosis method based on Sample-characteristic Oversampling TechniquE (SCOTE) and multi-class LS-SVM , 2020, Appl. Soft Comput..

[64]  Justin D. de Guia,et al.  Performance Comparison of Classification Algorithms for Diagnosing Chronic Kidney Disease , 2019, 2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ).

[65]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[66]  José Salvador Sánchez,et al.  DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem , 2020, Expert Syst. Appl..

[67]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[68]  Ching-Hsue Cheng,et al.  Exploring the Important Attributes of Human Immunodeficiency Virus and Generating Decision Rules , 2020, Symmetry.

[69]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[70]  Mihiretu Kebede,et al.  Predicting CD4 count changes among patients on antiretroviral treatment: Application of data mining techniques , 2017, Comput. Methods Programs Biomed..

[71]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[72]  Joan Lu,et al.  University of Huddersfield Repository Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis Examining Applying High Performance Genetic Data Feature Selection and Classification Algorithms for Colon Cancer Diagnosis , 2022 .

[73]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[74]  Derong Shen,et al.  A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data , 2020, J. Biomed. Informatics.

[75]  M. Mostafizur Rahman,et al.  Addressing the Class Imbalance Problem in Medical Datasets , 2013 .