Reverse Engineering and Evaluation of Prediction Models for Progression to Type 2 Diabetes

Background: Application of novel machine learning approaches to electronic health record (EHR) data could provide valuable insights into disease processes. We utilized this approach to build predictive models for progression to prediabetes and type 2 diabetes (T2D). Methods: Using a novel analytical platform (Reverse Engineering and Forward Simulation [REFS]), we built prediction model ensembles for progression to prediabetes or T2D from an aggregated EHR data sample. REFS relies on a Bayesian scoring algorithm to explore a wide model space, and outputs a distribution of risk estimates from an ensemble of prediction models. We retrospectively followed 24 331 adults for transitions to prediabetes or T2D, 2007-2012. Accuracy of prediction models was assessed using an area under the curve (AUC) statistic, and validated in an independent data set. Results: Our primary ensemble of models accurately predicted progression to T2D (AUC = 0.76), and was validated out of sample (AUC = 0.78). Models of progression to T2D consisted primarily of established risk factors (blood glucose, blood pressure, triglycerides, hypertension, lipid disorders, socioeconomic factors), whereas models of progression to prediabetes included novel factors (high-density lipoprotein, alanine aminotransferase, C-reactive protein, body temperature; AUC = 0.70). Conclusions: We constructed accurate prediction models from EHR data using a hypothesis-free machine learning approach. Identification of established risk factors for T2D serves as proof of concept for this analytical approach, while novel factors selected by REFS represent emerging areas of T2D research. This methodology has potentially valuable downstream applications to personalized medicine and clinical research.

[1]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[2]  D. Heckerman,et al.  A Bayesian Approach to Causal Discovery , 2006 .

[3]  J. Eberwine,et al.  Insulin Causes Hyperthermia by Direct Inhibition of Warm-Sensitive Neurons , 2009, Diabetes.

[4]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[5]  S. Inzucchi Clinical practice. Diagnosis of diabetes. , 2012, The New England journal of medicine.

[6]  G. Collins,et al.  Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting , 2011, BMC medicine.

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[9]  J. Walley,et al.  Liver aminotransferases and risk of incident type 2 diabetes: a systematic review and meta-analysis. , 2013, American journal of epidemiology.

[10]  Karel G M Moons,et al.  Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study , 2012, BMJ : British Medical Journal.

[11]  P. Barter,et al.  Effect of torcetrapib on glucose, insulin, and hemoglobin A1c in subjects in the Investigation of Lipid Level Management to Understand its Impact in Atherosclerotic Events (ILLUMINATE) trial. , 2011, Circulation.

[12]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[13]  Ralph B D'Agostino,et al.  Prediction of incident diabetes mellitus in middle-aged adults: the Framingham Offspring Study. , 2007, Archives of internal medicine.

[14]  J. H. Schuenemeyer,et al.  Generalized Linear Models (2nd ed.) , 1992 .

[15]  Desmond E. Williams,et al.  Changes in diabetes-related complications in the United States, 1990-2010. , 2014, The New England journal of medicine.

[16]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[17]  Sengwee Toh,et al.  Analyzing partially missing confounder information in comparative effectiveness and safety research of therapeutics , 2012, Pharmacoepidemiology and drug safety.

[18]  J. York,et al.  Bayesian Graphical Models for Discrete Data , 1995 .

[19]  A. Wägner,et al.  Interaction between Cholesteryl Ester Transfer Protein and Hepatic Lipase Encoding Genes and the Risk of Type 2 Diabetes: Results from the Telde Study , 2011, PloS one.

[20]  Tanya Cashorali,et al.  Causal Modeling Using Network Ensemble Simulations of Genetic and Gene Expression Data Predicts Genes Involved in Rheumatoid Arthritis , 2011, PLoS Comput. Biol..

[21]  Wolfgang Rathmann,et al.  Prediabetes: a high-risk state for diabetes development , 2012, The Lancet.

[22]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[23]  G. W. Snedecor Statistical Methods , 1964 .

[24]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[25]  G. Marchesini,et al.  Nonalcoholic fatty liver disease: a feature of the metabolic syndrome. , 2001, Diabetes.

[26]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[27]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[28]  C. Bogardus,et al.  High alanine aminotransferase is associated with decreased hepatic insulin sensitivity and predicts the development of type 2 diabetes. , 2002, Diabetes.

[29]  P. Barter,et al.  The emerging role of HDL in glucose metabolism , 2012, Nature Reviews Endocrinology.

[30]  George E. P. Box,et al.  Bayesian Inference in Statistical Analysis: Box/Bayesian , 1992 .