RSMOTE: improving classification performance over imbalanced medical datasets

Introduction Medical diagnosis is a crucial step for patient treatment. However, diagnosis is prone to bias due to imbalanced datasets. To overcome the imbalanced dataset problem, simple minority oversampling technique (SMOTE) was proposed that can generate new synthetic samples at data level to create the balance between minority and majority classes. However, the synthetic samples are generated on a random basis which causes class mixture problem; thus, resulting in deteriorating the classification performance and biased diagnosis. Purpose In order to overcome the SMOTE shortcomings, some modified methods were proposed that try to generate synthetic samples along the line segment of selected minority samples. Most of these methods adopt one of the two policies for selecting minority samples to generate synthetic samples: borderline region sampling or safe region sampling. However, they both suffer from over-generalisation problem. We propose a modified SMOTE-based resampling method called RSMOTE to alleviate the medical imbalanced dataset problem. We provide an in-depth analysis and verify the performance of RSMOTE over imbalanced medical datasets. Methods In this paper, the proposed RSMOTE divides the minority sample domain into four regions (normal, semi-normal, semi-critical, and critical) based on the minority sample density analysis. RSMOTE discovers the minority sample region globally and applies the resampling near a specific group of samples. Results Our analysis and experiments verify that if synthetic samples are generated in the regions with high minority sample density, classification performance will be improved due to low risk of class mixture. Unlike some safe region methods, RSMOTE decides the region of minority samples on a global basis, thus removing the over-generalisation problem. Classic and additional evaluation metrics are considered to measure the effectiveness of the modified method: Recall, FP Rate, Precision, F-Measure, ROC area, and Average Aggregated Metric. We carried out experiments over various imbalanced medical datasets. Conclusion Based on the minority sample density analysis, we propose RSMOTE method that divides the minority sample domain into four regions. The proposed RSMOTE includes four re-sampling methods that each of them carries out resampling on a specific region. According to the experimental results, resampling on the regions with high minority sample density obtained better results while those with lower minority sample density got the inferior results. Thus, we conclude that the RSMOTE is a more flexible resampling method for the imbalanced medical datasets that is capable of generating samples with various minority sample densities.

[1]  Jasjit S. Suri,et al.  Healthcare Text Classification System and its Performance Evaluation: A Source of Better Intelligence by Characterizing Healthcare Text , 2018, Journal of Medical Systems.

[2]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Mohsen Sardari Zarchi,et al.  SCADI: A standard dataset for self-care problems classification of children with physical and motor disability , 2018, Int. J. Medical Informatics.

[5]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[6]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  James A. Bartholomai,et al.  Prediction of lung cancer patient survival via supervised machine learning classification techniques , 2017, Int. J. Medical Informatics.

[8]  Shuo Yang,et al.  An improved Id3 algorithm for medical data classification , 2017, Comput. Electr. Eng..

[9]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[10]  Francisco Herrera,et al.  Managing Borderline and Noisy Examples in Imbalanced Classification by Combining SMOTE with Ensemble Filtering , 2014, IDEAL.

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  Chengfei Liu,et al.  A Framework for Clustering and Dynamic Maintenance of XML Documents , 2017, ADMA.

[13]  L. Nelson Sanchez-Pinto,et al.  Comparison of variable selection methods for clinical predictive modeling , 2018, Int. J. Medical Informatics.

[14]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[15]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[16]  Flávio H. D. Araújo,et al.  Using machine learning to support healthcare professionals in making preauthorisation decisions , 2016, Int. J. Medical Informatics.

[17]  Dalila Boughaci,et al.  Proteomics Versus Clinical Data and Stochastic Local Search Based Feature Selection for Acute Myeloid Leukemia Patients’ Classification , 2018, Journal of Medical Systems.

[18]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[19]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[20]  João Cardoso,et al.  Supervised learning methods for pathological arterial pulse wave differentiation: A SVM and neural networks approach , 2018, Int. J. Medical Informatics.

[21]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[22]  Rui Zhou,et al.  An effective density-based clustering and dynamic maintenance framework for evolving medical data streams , 2019, Int. J. Medical Informatics.

[23]  Shang Gao,et al.  Grouped SMOTE With Noise Filtering Mechanism for Classifying Imbalanced Data , 2019, IEEE Access.

[24]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[25]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[26]  Chengfei Liu,et al.  A Framework for Processing Cumulative Frequency Queries over Medical Data Streams , 2018, WISE.