Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm

BACKGROUND AND OBJECTIVE Diabetes is a silent killer. The main cause of this disease is the presence of excessive amounts of metabolites such as glucose. There were about 387 million diabetic people all over the world in 2014. The financial burden of this disease has been calculated to be about $13,700 per year. According to the World Health Organization (WHO), these figures will more than double by the year 2030. This cost will be reduced dramatically if someone can predict diabetes statistically on the basis of some covariates. Although several classification techniques are available, it is very difficult to classify diabetes. The main objectives of this paper are as follows: (i) Gaussian process classification (GPC), (ii) comparative classifier for diabetes data classification, (iii) data analysis using the cross-validation approach, (iv) interpretation of the data analysis and (v) benchmarking our method against others. METHODS To classify diabetes, several classification techniques are used such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and Naive Bayes (NB). However, most of the medical data show non-normality, non-linearity and inherent correlation structure. So in this paper we adapted Gaussian process (GP)-based classification technique using three kernels namely: linear, polynomial and radial basis kernel. We also investigate the performance of a GP-based classification technique in comparison to existing techniques such as LDA, QDA and NB. Performances are evaluated by using the accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) and receiver-operating characteristic (ROC) curves. RESULTS Pima Indian diabetes dataset is taken as part of the study. This consists of 768 patients, of which 268 patients are diabetic and 500 patients are controls. Our machine learning system shows the performance of GP-based model as: ACC 81.97%, SE 91.79%, SP 63.33%, PPV 84.91% and NPV 62.50% which are larger compared to other methods.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Jing Zhao,et al.  Epidemiological Perspectives of Diabetes , 2015, Cell Biochemistry and Biophysics.

[3]  Somula Ramasubbareddy,et al.  Classification of Heart Disease Using Support Vector Machine , 2019, Journal of Computational and Theoretical Nanoscience.

[4]  Jiri Kaiser,et al.  Dealing with Missing Values in Data , 2014 .

[5]  M. Baneshi,et al.  Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models? , 2012, Iranian Red Crescent medical journal.

[6]  J. M. DeLeo,et al.  Receiver operating characteristic laboratory (ROCLAB): Software for developing decision strategies that account for uncertainty , 1993, 1993 (2nd) International Symposium on Uncertainty Modeling and Analysis.

[7]  U. Rajendra Acharya,et al.  Automated identification of normal and diabetes heart rate signals using nonlinear measures , 2013, Comput. Biol. Medicine.

[8]  Charles C. Taylor,et al.  Kernel density classification and boosting: an L2 analysis , 2005, Stat. Comput..

[9]  Murat Kayri,et al.  The Effects of Methods of Imputation for Missing Values on the Validity and Reliability of Scales. , 2011 .

[10]  Nikola K. Kasabov,et al.  On-line pattern analysis by evolving self-organizing maps , 2003, Neurocomputing.

[11]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[12]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Kemal Polat,et al.  The Medical Applications of Attribute Weighted Artificial Immune System (AWAIS): Diagnosis of Heart and Diabetes Diseases , 2005, ICARIS.

[14]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[15]  C. A. Smith Some examples of discrimination. , 1947, Annals of eugenics.

[16]  Shirin Akhter Begum,et al.  Diabetes Mellitus and Gestational Diabetes Mellitus , 2015 .

[17]  Jasjit S Suri,et al.  Is there an association between leukoaraiosis volume and diabetes? , 2016, Journal of neuroradiology. Journal de neuroradiologie.

[18]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[19]  Cengiz Sertkaya,et al.  Comparison of different methods for determining diabetes , 2014 .

[20]  U. Rajendra Acharya,et al.  Computer-Based Identification of Type 2 Diabetic Subjects with and Without Neuropathy Using Dynamic Planter Pressure and Principal Component Analysis , 2011, Journal of Medical Systems.

[21]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[22]  Mehdi Khashei,et al.  Diagnosing Diabetes Type II Using a Soft Intelligent Binary Classification Model , 2012 .

[23]  Andrew P. Bradley,et al.  Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus , 2010, IEEE Transactions on Information Technology in Biomedicine.

[24]  M. S. Kirkman,et al.  Type 1 Diabetes Through the Life Span: A Position Statement of the American Diabetes Association , 2014, Diabetes Care.

[25]  U. Rajendra Acharya,et al.  Algorithms for the Automated Detection of Diabetic Retinopathy Using Digital Fundus Images: A Review , 2012, Journal of Medical Systems.

[26]  John Joseph Valletta,et al.  Gaussian process modelling of blood glucose response to free-living physical activity data in people with type 1 diabetes , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[27]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[28]  Y Angeline Christobel,et al.  A NEW CLASSWISE K NEAREST NEIGHBOR (CKNN) METHOD FOR THE CLASSIFICATION OF DIABETES DATASET , 2013 .

[29]  Pragati Agrawal,et al.  Classification of Diabetes Mellitus Using Machine Learning Techniques , 2015 .

[30]  Nuryazmin Ahmat Zainuri,et al.  A comparison of various imputation methods for missing values in air quality data , 2015 .

[31]  Filippo Molinari,et al.  Association of automated carotid IMT measurement and HbA1c in Japanese patients with coronary artery disease. , 2013, Diabetes research and clinical practice.

[32]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Kemal Polat,et al.  An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease , 2007, Digit. Signal Process..

[34]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .