Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique

Imbalanced datasets typically occur in many real applications. Resampling is one of the effective solutions due to producing a balanced class distribution. Synthetic Minority Over-sampling technique (SMOTE), an over-sampling technique is used in this study for dealing the imbalanced dataset by add the number of instances of a minority class. This technique is used to decrease the imbalance percentage of the dataset by generating new synthetic samples. Thus, a balanced training dataset is produced to replace the class imbalanced. The balanced datasets were obtained and trained with machine learning algorithms to diagnose the disease’s class. Through the experiment findings on the real-world datasets, oral cancer dataset and erythemato-squamous diseases dataset from the UCI machine learning datasets, an over-sampling method showed better results in clinical disease classification.

[1]  Xingquan Zhu,et al.  iSRD: Spam review detection with imbalanced data distributions , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[2]  Divya Jain,et al.  Feature selection and classification systems for chronic disease prediction: A review , 2018, Egyptian Informatics Journal.

[3]  M. Topczewska,et al.  Data preprocessing in the classification of the imbalanced data , 2014 .

[4]  Milan Tuba,et al.  Support Vector Machine Optimized by Elephant Herding Algorithm for Erythemato-Squamous Diseases Detection , 2017, ITQM.

[5]  Eibe Frank,et al.  Introducing Machine Learning Concepts with WEKA , 2016, Statistical Genomics.

[6]  M. Mostafizur Rahman,et al.  Addressing the Class Imbalance Problem in Medical Datasets , 2013 .

[7]  El-Sayed M. El-Alfy,et al.  Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text , 2017, ANT/SEIT.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Jian-Ping Li,et al.  A Classification Model for Imbalanced Medical Data based on PCA and Farther Distance based Synthetic Minority Oversampling Technique , 2017 .

[10]  R. Geetha,et al.  Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier , 2019, Journal of Medical Systems.

[11]  B Santoso,et al.  Synthetic Over Sampling Methods for Handling Class Imbalanced Problems : A Review , 2017 .

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Jordan M. Malof,et al.  The effect of class imbalance on case selection for case-based classifiers: An empirical study in the context of medical decision support , 2012, Neural Networks.

[14]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[15]  Noor Maizura Mohamad Noor,et al.  A Hybrid Selection Method Based on HCELFS and SVM for the Diagnosis of Oral Cancer Staging , 2015 .

[16]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[17]  Noorhaniza Wahid,et al.  Benchmark of feature selection techniques with machine learning algorithms for cancer datasets , 2016, ICAIR-CACRE '16.

[18]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[19]  Gustavo E. A. P. A. Batista,et al.  A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System , 2008, IFIP AI.

[20]  Zhuoyuan Zheng,et al.  Oversampling Method for Imbalanced Classification , 2015, Comput. Informatics.

[21]  K. S. Ravichandran,et al.  Estimation of automatic detection of erythemato-squamous diseases through AdaBoost and its hybrid classifiers , 2015, Artificial Intelligence Review.

[22]  Francisco Herrera,et al.  An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing , 2019, JCC&BD.

[23]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[24]  F. Alsaadi,et al.  Robust Control for a Class of Discrete Time-Delay Stochastic Systems with Randomly Occurring Nonlinearities , 2014 .

[25]  Kadir Sabanci,et al.  The Classification of Eye State by Using kNN and MLP Classification Models According to the EEG Signals , 2015 .

[26]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[27]  M. A. H. Farquad,et al.  Preprocessing unbalanced data using support vector machine , 2012, Decis. Support Syst..

[28]  Suresh N. Mali,et al.  Classifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach☆ , 2016 .

[29]  Zhengxing Huang,et al.  MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical records , 2017, J. Biomed. Informatics.

[30]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[31]  Amir F. Atiya,et al.  A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance , 2019, Inf. Sci..

[32]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[33]  Horst Bunke,et al.  Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Q. Wang A Hybrid Sampling SVM Approach to Imbalanced Data Classification , 2014 .

[35]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[36]  Noor Maizura Mohamad Noor,et al.  ENHANCEMENT OF BAYESIAN MODEL WITH RELEVANCE FEEDBACK FOR IMPROVING DIAGNOSTIC MODEL , 2018 .

[37]  Paul Mangiameli,et al.  The Effects and Interactions of Data Quality and Problem Complexity on Classification , 2011, JDIQ.