Predicting Diabetes Mellitus With Machine Learning Techniques

Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world’s diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients’ data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.

[1]  Bi-Qing Li,et al.  Prediction of Linear B-Cell Epitopes with mRMR Feature Selection and Analysis , 2016 .

[2]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[5]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[6]  David Hamilton,et al.  Blood Glucose Prediction Using Artificial Neural Networks Trained with the AIDA Diabetes Simulator: A Proof-of-Concept Pilot Study , 2011, J. Electr. Comput. Eng..

[7]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[8]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[9]  Yan He,et al.  Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques , 2017, Current Bioinformatics.

[10]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Esin Dogantekin,et al.  An automatic diabetes diagnosis system based on LDA-Wavelet Support Vector Machine Classifier , 2011, Expert Syst. Appl..

[12]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[13]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[14]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[15]  E. Iancu,et al.  Method for the analysing of blood glucose dynamics in diabetes mellitus patients , 2008, 2008 IEEE International Conference on Automation, Quality and Testing, Robotics.

[16]  Dimitrios I. Fotiadis,et al.  Multivariate Prediction of Subcutaneous Glucose Concentration in Type 1 Diabetes Patients Based on Support Vector Regression , 2013, IEEE Journal of Biomedical and Health Informatics.

[17]  David Edelman,et al.  Tests for Screening and Diagnosis of Type 2 Diabetes , 2009, Clinical Diabetes.

[18]  Kemal Polat,et al.  The Medical Applications of Attribute Weighted Artificial Immune System (AWAIS): Diagnosis of Heart and Diabetes Diseases , 2005, ICARIS.

[19]  Senlin Luo,et al.  Rule Extraction From Support Vector Machines Using Ensemble Learning Approach: An Application for Diagnosis of Diabetes , 2015, IEEE Journal of Biomedical and Health Informatics.

[20]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[21]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[22]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[23]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[24]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[25]  Fikret S. Gürgen,et al.  A feature selection method based on kernel canonical correlation analysis and the minimum Redundancy-Maximum Relevance filter method , 2012, Expert Syst. Appl..

[26]  A. B. Watkins,et al.  A resource limited artificial immune classifier , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[27]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[28]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[29]  P. O S I T I O N S T A T E M E N T,et al.  Diagnosis and Classification of Diabetes Mellitus , 2011, Diabetes Care.

[30]  Seema Sharma,et al.  Classification Through Machine Learning Technique: C4. 5 Algorithm based on Various Entropies , 2013 .

[31]  David A. Sontag,et al.  Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors , 2015, Big Data.

[32]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[33]  I. Vlahavas,et al.  Machine Learning and Data Mining Methods in Diabetes Research , 2017, Computational and structural biotechnology journal.

[34]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[35]  Vasant G Honavar,et al.  Intelligent Diagnosis , 1998 .

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Yu-Dong Cai,et al.  Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm , 2016 .

[38]  Manal Alghamdi,et al.  Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project , 2017, PloS one.

[39]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[40]  Jong Yeol Kim,et al.  Identification of Type 2 Diabetes Risk Factors Using Phenotypes Consisting of Anthropometry and Triglycerides based on Machine Learning , 2016, IEEE Journal of Biomedical and Health Informatics.

[41]  Kuldip K. Paliwal,et al.  Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition , 2003, Pattern Recognit..

[42]  Maryam Ahmadi,et al.  Type 2 Diabetes Mellitus Screening and Risk Factors Using Decision Tree: Results of Data Mining , 2015, Global journal of health science.

[43]  Hongmin Cai,et al.  Low Rank Representation and Its Application in Bioinformatics , 2018, Current Bioinformatics.

[44]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[45]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[46]  C. Brodley,et al.  Decision tree classification of land cover from remotely sensed data , 1997 .

[47]  K. T. Mathew,et al.  Diagnosis of Diabetes Mellitus using Microwaves , 2007 .

[48]  Arif Gülten,et al.  Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms , 2011, Comput. Methods Programs Biomed..

[49]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[50]  Ying Ju,et al.  Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest , 2016, Scientifica.

[51]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[52]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[55]  Kemal Polat,et al.  An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease , 2007, Digit. Signal Process..

[56]  A. Krasteva,et al.  Oral Cavity and Systemic Diseases—Diabetes Mellitus , 2011 .

[57]  Xia Kewen,et al.  An Intelligent Diagnosis to Type 2 Diabetes Based on QPSO Algorithm and WLS-SVM , 2008, 2008 International Symposium on Intelligent Information Technology Application Workshops.

[58]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.