Comparison of Statistical Logistic Regression and RandomForest Machine Learning Techniques in Predicting Diabetes

Diabetes is one of the global concerns in the healthcare domain and one of the leading challenges locally in Saudi Arabia. The prevalence of diabetes is anticipated to rise; early prediction of individuals at high risk of diabetes is a significant challenge. This study aims to compare RandomForest machine learning algorithm and Logistic Regression algorithm towards the prediction of diabetes. We analyzed 66,325 records that extracted from the Ministry of National Guard Hospital Affairs (MNGHA) databases in Saudi Arabia between 2013 and 2015. Both Machine Learning algorithms were applied to predict diabetes based on 18 risk factors. The evaluation criteria to compare the two algorithms were based on precision, Recall, True Positive rate, False Negative rate, F-measure and Area under the curve. The overall prevalence of diabetes in the data set is 64.47%. Male represents 55.50% of the data set while female represents 44.50%. For RandomForest (RF) model, the precision, Recall, True Positive Rate, False Positive Rate and F-measure value for predicting diabetes were 0.883, 0.88, 0.88, 0.188 and 0.876, respectively, while Logistic Regression model were only 0.692, 0.703, 0.703,0.454 and 0.675, respectively. Area under the ROC curve (AUC) value was 0.944 for the RF model and 0.708 for Logistic Regression model, which demonstrates higher predictive performance for RF than the Logistic Regression model. The RF algorithm showed superior prediction performance over Logistic Regression technique in predicting diabetes based on various

[1]  Xuehui Meng,et al.  Comparison of three data mining models for predicting diabetes or prediabetes by risk factors , 2013, The Kaohsiung journal of medical sciences.

[2]  W. Kurutach,et al.  Association analysis of diabetes mellitus (DM) with complication states based on association rules , 2012, 2012 7th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[3]  Konstantina S. Nikita,et al.  Comparative assessment of statistical and machine learning techniques towards estimating the risk of developing type 2 diabetes and cardiovascular complications , 2017, Expert Syst. J. Knowl. Eng..

[4]  Shankaracharya,et al.  Computational intelligence in early diabetes diagnosis: a review. , 2010, The review of diabetic studies : RDS.

[5]  I. Vlahavas,et al.  Machine Learning and Data Mining Methods in Diabetes Research , 2017, Computational and structural biotechnology journal.

[6]  Konstantina S. Nikita,et al.  A Review of Emerging Technologies for the Management of Diabetes Mellitus , 2015, IEEE Transactions on Biomedical Engineering.

[7]  David A. Sontag,et al.  Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors , 2015, Big Data.

[8]  Tae-Hyung Kim,et al.  Feature selection for manufacturing process monitoring using cross-validation , 2013 .

[9]  Riyad Alshammari,et al.  Building Diabetes Early Warning System Using Data Mining Techniques , 2017 .

[10]  Manal Alghamdi,et al.  Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project , 2017, PloS one.

[11]  Ling Wang,et al.  Evaluating the risk of type 2 diabetes mellitus using artificial neural network: an effective classification approach. , 2013, Diabetes research and clinical practice.

[12]  Joost R. Duflou,et al.  A Comparison of Classifiers for Intelligent Machine Usage Prediction , 2014, 2014 International Conference on Intelligent Environments.

[13]  S. Keteyian,et al.  Using Machine Learning to Define the Association between Cardiorespiratory Fitness and All-Cause Mortality (from the Henry Ford Exercise Testing Project). , 2017, The American journal of cardiology.

[14]  G. Collins,et al.  Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting , 2011, BMC medicine.

[15]  Joshua C. Denny,et al.  Type 2 Diabetes Risk Forecasting from EMR Data using Machine Learning , 2012, AMIA.

[16]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[18]  Tahani Daghistani,et al.  Diagnosis of Diabetes by Applying Data Mining Classification Techniques , 2016 .

[19]  C. Florkowski Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. , 2008, The Clinical biochemist. Reviews.

[20]  S. Lemon,et al.  Classification and regression tree analysis in public health: Methodological review and comparison with logistic regression , 2003, Annals of behavioral medicine : a publication of the Society of Behavioral Medicine.