Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study

CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

[1]  J. Critchley,et al.  Risk scores based on self-reported or available clinical data to detect undiagnosed type 2 diabetes: a systematic review. , 2012, Diabetes research and clinical practice.

[2]  Nada Lavrac,et al.  Selected techniques for data mining in medicine , 1999, Artif. Intell. Medicine.

[3]  Trisha Greenhalgh,et al.  Risk models and scores for type 2 diabetes: systematic review , 2011, BMJ : British Medical Journal.

[4]  J. Shaw,et al.  Global estimates of diabetes prevalence for 2013 and projections for 2035. , 2014, Diabetes Research and Clinical Practice.

[5]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[6]  Ling Wang,et al.  Evaluating the risk of type 2 diabetes mellitus using artificial neural network: an effective classification approach. , 2013, Diabetes research and clinical practice.

[7]  Illhoi Yoo,et al.  Data Mining in Healthcare and Biomedicine: A Survey of the Literature , 2012, Journal of Medical Systems.

[8]  Kamlesh Khunti,et al.  Risk assessment tools for detecting those with pre-diabetes: a systematic review. , 2014, Diabetes research and clinical practice.

[9]  R. Mansour,et al.  Comparison of Artificial Neural Network, Logistic Regression and Discriminant Analysis Efficiency in Determining Risk Factors of Type 2 Diabetes , 2013 .

[10]  Karel G M Moons,et al.  Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study , 2012, BMJ : British Medical Journal.

[11]  Moyses Szklo,et al.  American Journal of Epidemiology Practice of Epidemiology Brazilian Longitudinal Study of Adult Health (elsa-brasil): Objectives and Design , 2022 .

[12]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[13]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[14]  James H Harrison,et al.  Introduction to the mining of clinical data. , 2008, Clinics in laboratory medicine.

[15]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[16]  G. Collins,et al.  Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting , 2011, BMC medicine.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Luis González Abril,et al.  Ameva: An autonomous discretization algorithm , 2009, Expert Syst. Appl..

[19]  A. Motala,et al.  Global estimates of undiagnosed diabetes in adults. , 2014, Diabetes research and clinical practice.

[20]  Donald E. Brown,et al.  Introduction to data mining for medical informatics. , 2008, Clinics in laboratory medicine.

[21]  David Newby,et al.  Survey of diabetes risk assessment tools: concepts, structure and performance , 2012, Diabetes/metabolism research and reviews.

[22]  B. Duncan,et al.  Cohort Profile Cohort Profile : Longitudinal Study of Adult Health ( ELSA-Brasil ) , 2015 .

[23]  Deok Won Kim,et al.  Screening for Prediabetes Using Machine Learning Models , 2014, Comput. Math. Methods Medicine.

[24]  Mary K Obenshain Application of Data Mining Techniques to Healthcare Data , 2004, Infection Control & Hospital Epidemiology.

[25]  Boncho Ku,et al.  Prediction of Fasting Plasma Glucose Status Using Anthropometric Measures for Diagnosing Type 2 Diabetes , 2014, IEEE Journal of Biomedical and Health Informatics.

[26]  Eddy Karnieli,et al.  Preventing type 2 diabetes mellitus: a call for personalized intervention. , 2013, The Permanente journal.

[27]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[28]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[29]  Hudson Fernandes Golino,et al.  Predicting Increased Blood Pressure Using Machine Learning , 2014, Journal of obesity.

[30]  Shankaracharya,et al.  Computational intelligence in early diabetes diagnosis: a review. , 2010, The review of diabetic studies : RDS.

[31]  Jamal Shahrabi,et al.  Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. , 2014, Diabetes research and clinical practice.

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[34]  H. Bang,et al.  A Simple Screening Score for Diabetes for the Korean Population , 2012, Diabetes Care.

[35]  Simon J. Griffin,et al.  Risk Assessment Tools for Identifying Individuals at Risk of Developing Type 2 Diabetes , 2011, Epidemiologic reviews.