Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.

[1]  Chetan Patil,et al.  Heart Disease Diagnosis using Support Vector Machine , 2011 .

[2]  Yeni Herdiyeni,et al.  Analysis of the Effect of Data Scaling on the Performance of the Machine Learning Algorithm for Plant Identification , 2020 .

[3]  Sachin Ahuja,et al.  Multilayer perceptron based deep neural network for early detection of coronary heart disease , 2020 .

[4]  Dimitrios I. Fotiadis,et al.  Heart Failure: Diagnosis, Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques , 2016, Computational and structural biotechnology journal.

[5]  Soni Jyoti,et al.  Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , 2011 .

[6]  Kishor Datta Gupta,et al.  A Genetic Algorithm Approach to Optimize Dispatching for A Microgrid Energy System with Renewable Energy Sources , 2019 .

[7]  Leili Shahriyari,et al.  Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma , 2019, Briefings Bioinform..

[8]  K Kasikumar,et al.  Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks , 2018 .

[9]  Hidayet Takçı,et al.  Improvement of heart attack prediction by the feature selection methods , 2018, Turkish J. Electr. Eng. Comput. Sci..

[10]  Ashok Ghatol,et al.  Feature selection for medical diagnosis : Evaluation for cardiovascular diseases , 2013, Expert Syst. Appl..

[11]  Yueqing Li,et al.  Evaluating the Performance of Eigenface, Fisherface, and Local Binary Pattern Histogram-Based Facial Recognition Methods under Various Weather Conditions , 2021, Technologies.

[12]  Saba Bashir,et al.  Improving Heart Disease Prediction Using Feature Selection Approaches , 2019, 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST).

[13]  Dongkyoo Shin,et al.  Effective Diagnosis of Heart Disease through Bagging Approach , 2009, 2009 2nd International Conference on Biomedical Engineering and Informatics.

[14]  Yong Wang,et al.  A Many-Objective Evolutionary Algorithm with Angle-Based Selection and Shift-Based Density Estimation , 2017, ArXiv.

[15]  Saurabh Pal,et al.  Early Prediction of Heart Diseases Using Data Mining Techniques , 2013 .

[16]  Sergey V. Kovalchuk,et al.  Comparison of Temporal and Non-Temporal Features Effect on Machine Learning Models Quality and Interpretability for Chronic Heart Failure Patients , 2019, Procedia Computer Science.

[17]  Nilanjan Dey,et al.  Systematic Analysis of Applied Data Mining Based Optimization Algorithms in Clinical Attribute Extraction and Classification for Diagnosis of Cardiac Patients , 2016, Applications of Intelligent Optimization in Biology and Medicine.

[18]  Kasturi Dewi Varathan,et al.  Identification of significant features and data mining techniques in predicting heart disease , 2019, Telematics Informatics.

[19]  Maxim A. Dulebenets,et al.  A novel memetic algorithm with a deterministic parameter control for efficient berth scheduling at marine container terminals , 2017 .

[20]  Mai Shouman,et al.  Integrating Decision Tree and K-Means Clustering with Different Initial Centroid Selection Methods in the Diagnosis of Heart Disease Patients , 2012 .

[21]  S Dr.HariGanesh,et al.  Comparative study of Data Mining Approaches for prediction Heart Diseases , 2014 .

[22]  Changsheng Zhang,et al.  An online-learning-based evolutionary many-objective algorithm , 2020, Inf. Sci..

[23]  Gianni D'Angelo,et al.  A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees , 2019, Soft Computing.

[24]  Abbas Z. Kouzani,et al.  Applications and Evaluations of Bio-Inspired Approaches in Cloud Security: A Review , 2020, IEEE Access.

[25]  Zahed Siddique,et al.  Detecting SARS-CoV-2 From Chest X-Ray Using Artificial Intelligence , 2021, IEEE Access.

[26]  Mohamed Bahaj,et al.  K-Nearest Neighbour Model Optimized by Particle Swarm Optimization and Ant Colony Optimization for Heart Disease Classification , 2018, Studies in Big Data.

[27]  Weihong Guo,et al.  An Optimization Model and Solution Algorithms for the Vehicle Routing Problem With a “Factory-in-a-Box” , 2020, IEEE Access.

[28]  Mohammad Ayoub Khan An IoT Framework for Heart Disease Prediction Based on MDCNN Classifier , 2020, IEEE Access.

[29]  Md Manjurul Ahsan,et al.  Deep MLP-CNN Model Using Mixed-Data to Distinguish between COVID-19 and Non-COVID-19 Patients , 2020, Symmetry.

[30]  Kishor Datta Gupta,et al.  COVID-19 Symptoms Detection Based on NasNetMobile with Explainable AI Using Various Imaging Modalities , 2020, Mach. Learn. Knowl. Extr..

[31]  Thomas Wetter Medical Decision Support Systems , 2000, ISMDA.

[32]  Kishor Datta Gupta,et al.  Study of Different Deep Learning Approach with Explainable AI for Screening Patients with COVID-19 Symptoms: Using CT Scan and Chest X-ray Image Dataset , 2020, ArXiv.