Mixed effect machine learning: A framework for predicting longitudinal change in hemoglobin A1c

Accurate and reliable prediction of clinical progression over time has the potential to improve the outcomes of chronic disease. The classical approach to analyzing longitudinal data is to use (generalized) linear mixed-effect models (GLMM). However, linear parametric models are predicated on assumptions, which are often difficult to verify. In contrast, data-driven machine learning methods can be applied to derive insight from the raw data without a priori assumptions. However, the underlying theory of most machine learning algorithms assume that the data is independent and identically distributed, making them inefficient for longitudinal supervised learning. In this study, we formulate an analytic framework, which integrates the random-effects structure of GLMM into non-linear machine learning models capable of exploiting temporal heterogeneous effects, sparse and varying-length patient characteristics inherent in longitudinal data. We applied the derived mixed-effect machine learning (MEml) framework to predict longitudinal change in glycemic control measured by hemoglobin A1c (HbA1c) among well controlled adults with type 2 diabetes. Results show that MEml is competitive with traditional GLMM, but substantially outperformed standard machine learning models that do not account for random-effects. Specifically, the accuracy of MEml in predicting glycemic change at the 1st, 2nd, 3rd, and 4th clinical visits in advanced was 1.04, 1.08, 1.11, and 1.14 times that of the gradient boosted model respectively, with similar results for the other methods. To further demonstrate the general applicability of MEml, a series of experiments were performed using real publicly available and synthetic data sets for accuracy and robustness. These experiments reinforced the superiority of MEml over the other methods. Overall, results from this study highlight the importance of modeling random-effects in machine learning approaches based on longitudinal data. Our MEml method is highly resistant to correlated data, readily accounts for random-effects, and predicts change of a longitudinal clinical outcome in real-world clinical settings with high accuracy.

[1]  Nilay D Shah,et al.  Optum Labs: building a novel node in the learning health care system. , 2014, Health affairs.

[2]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[3]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[4]  R. Holman,et al.  Glycemic control with diet, sulfonylurea, metformin, or insulin in patients with type 2 diabetes mellitus: progressive requirement for multiple therapies (UKPDS 49). UK Prospective Diabetes Study (UKPDS) Group. , 1999, JAMA.

[5]  Peter J Diggle,et al.  Joint modelling of repeated measurement and time-to-event data: an introductory tutorial. , 2015, International journal of epidemiology.

[6]  Rury R. Holman,et al.  Glycemic Control with Diet, Sulfonylurea, Metformin, or Insulin in Patients with Type 2 Diabetes Mellitus: Progressive Requirement for Multiple Therapies (UKPDS 49) , 1999 .

[7]  David Moher,et al.  Effectiveness of quality improvement strategies on the management of diabetes: a systematic review and meta-analysis , 2012, The Lancet.

[8]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[9]  Denis Larocque,et al.  Generalized mixed effects regression trees , 2010 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Jeffrey S. Simonoff,et al.  RE-EM trees: a data mining approach for longitudinal and clustered data , 2011, Machine Learning.

[12]  Heping Zhang Classification Trees for Multiple Binary Responses , 1998 .

[13]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[14]  P. Grambsch,et al.  Application of the Mayo primary biliary cirrhosis survival model to Mayo liver transplant patients. , 1989, Mayo Clinic proceedings.

[15]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[16]  G. De’ath MULTIVARIATE REGRESSION TREES: A NEW TECHNIQUE FOR MODELING SPECIES–ENVIRONMENT RELATIONSHIPS , 2002 .

[17]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[18]  Nilay D Shah,et al.  HbA1c overtesting and overtreatment among US adults with controlled type 2 diabetes, 2001-13: observational population based study , 2015, BMJ : British Medical Journal.

[19]  Brian Caffo,et al.  Trajectories of Glycemic Change in a National Cohort of Adults With Previously Controlled Type 2 Diabetes , 2017, Medical care.

[20]  R. Deyo,et al.  Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. , 1992, Journal of clinical epidemiology.

[21]  Irl B Hirsch,et al.  CONSENSUS STATEMENT BY THE AMERICAN ASSOCIATION OF CLINICAL ENDOCRINOLOGISTS AND AMERICAN COLLEGE OF ENDOCRINOLOGY ON THE COMPREHENSIVE TYPE 2 DIABETES MANAGEMENT ALGORITHM--2016 EXECUTIVE SUMMARY. , 2016, Endocrine practice : official journal of the American College of Endocrinology and the American Association of Clinical Endocrinologists.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  W. Stroup Generalized Linear Mixed Models: Modern Concepts, Methods and Applications , 2012 .

[25]  Erniel B. Barrios,et al.  Small Sample Estimation in Dynamic Panel Data Models: A Simulation Study , 2011 .

[26]  Russell Scott,et al.  Glycemic Control Over 5 Years in 4,900 People With Type 2 Diabetes , 2012, Diabetes Care.

[27]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[28]  Hutan Ashrafian,et al.  Longitudinal study of the profile and predictors of left ventricular mass regression after stentless aortic valve replacement. , 2008, The Annals of thoracic surgery.

[29]  Glenn De ' ath,et al.  MULTIVARIATE REGRESSION TREES: A NEW TECHNIQUE FOR MODELING SPECIES-ENVIRONMENT RELATIONSHIPS , 2002 .

[30]  C. Camargo,et al.  Validation of ICD-9-CM coding algorithm for improved identification of hypoglycemia visits , 2008, BMC endocrine disorders.

[31]  M. Segal Tree-Structured Methods for Longitudinal Data , 1992 .

[32]  A. Dreher Modeling Survival Data Extending The Cox Model , 2016 .

[33]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .

[34]  Tetsunori Kobayashi,et al.  A Sequential Pattern Classifier Based on Hidden Markov Kernel Machine and Its Application to Phoneme Classification , 2010, IEEE Journal of Selected Topics in Signal Processing.