An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values

Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.

[1]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[2]  Durga Toshniwal,et al.  Hybrid prediction model for Type-2 diabetic patients , 2010, Expert Syst. Appl..

[3]  Liyakathunisa Syed,et al.  Comparative analysis of different classification algorithms for prediction of diabetes disease , 2017, ICC.

[4]  Jiadong Ren,et al.  DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values , 2019, IEEE Access.

[5]  Aida Mustapha,et al.  Comparison between Neural Networks against Decision Tree in Improving Prediction Accuracy for Diabetes Mellitus , 2011, ICDIPC.

[6]  S J Pöppl,et al.  Predicting Type 2 diabetes using an electronic nose-based artificial neural network analysis. , 2002, Diabetes, nutrition & metabolism.

[7]  Diana W. Guthrie,et al.  Management of Diabetes Mellitus: A Guide to the Pattern Approach , 1997 .

[8]  Dilip Singh Sisodia,et al.  Prediction of Diabetes using Classification Algorithms , 2018 .

[9]  Abdul Wahab,et al.  A model for early prediction of diabetes , 2019, Informatics in Medicine Unlocked.

[10]  Ludmil Mikhailov,et al.  Evolving fuzzy medical diagnosis of Pima Indians diabetes and of dermatological diseases , 2010, Artif. Intell. Medicine.

[11]  S. Wild,et al.  Global prevalence of diabetes: estimates for the year 2000 and projections for 2030. , 2004, Diabetes care.

[12]  Ronnie D. Caytiles,et al.  A Deep Learning Approach to Identify Diabetes , 2017 .

[13]  Ahmad Jafarian,et al.  A New Artificial Neural Networks Approach for Diagnosing Diabetes Disease Type II , 2016 .

[14]  Diego Andina,et al.  A Prediction Model to Diabetes Using Artificial Metaplasticity , 2011, IWINAC.

[15]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[16]  Novruz Allahverdi,et al.  Design of a hybrid system for the diabetes and heart diseases , 2008, Expert Syst. Appl..

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Shengqi Yang,et al.  Type 2 diabetes mellitus prediction model based on data mining , 2018 .

[19]  J. Pickup Inflammation and activated innate immunity in the pathogenesis of type 2 diabetes. , 2004, Diabetes care.