Functional clustering methods for longitudinal data with application to electronic health records

We develop a method to estimate subject-level trajectory functions from longitudinal data. The approach can be used for patient phenotyping, feature extraction, or, as in our motivating example, outcome identification, which refers to the process of identifying disease status through patient laboratory tests rather than through diagnosis codes or prescription information. We model the joint distribution of a continuous longitudinal outcome and baseline covariates using an enriched Dirichlet process prior. This joint model decomposes into (local) semiparametric linear mixed models for the outcome given the covariates and simple (local) marginals for the covariates. The nonparametric enriched Dirichlet process prior is placed on the regression and spline coefficients, the error variance, and the parameters governing the predictor space. This leads to clustering of patients based on their outcomes and covariates. We predict the outcome at unobserved time points for subjects with data at other time points as well as for new subjects with only baseline covariates. We find improved prediction over mixed models with Dirichlet process priors when there are a large number of covariates. Our method is demonstrated with electronic health records consisting of initiators of second-generation antipsychotic medications, which are known to increase the risk of diabetes. We use our model to predict laboratory values indicative of diabetes for each individual and assess incidence of suspected diabetes from the predicted dataset.

[1]  Peter Müller,et al.  Semiparametric Bayesian classification with longitudinal markers , 2007, Journal of the Royal Statistical Society. Series C, Applied statistics.

[2]  Runze Li,et al.  A Bayesian semiparametric model for bivariate sparse longitudinal data , 2013, Statistics in medicine.

[3]  W. Johnson,et al.  A Bayesian Semiparametric AFT Model for Interval-Censored Data , 2004 .

[4]  Enrique ter Horst,et al.  Bayesian dynamic density estimation , 2008 .

[5]  David B. Dunson,et al.  Bayesian Semiparametric Joint Models for Functional Predictors , 2009, Journal of the American Statistical Association.

[6]  S. Haneuse,et al.  A General Framework for Considering Selection Bias in EHR-Based Studies: What Data Are Observed and Why? , 2016, EGEMS.

[7]  A. Go,et al.  Abrupt Decline in Kidney Function Before Initiating Hemodialysis and All-Cause Mortality: The Chronic Renal Insufficiency Cohort (CRIC) Study. , 2016, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[8]  T. Stukel,et al.  Importance of accurately identifying disease in studies using electronic health records , 2010, BMJ : British Medical Journal.

[9]  P. Müller,et al.  Bayesian Inference in Semiparametric Mixed Models for Longitudinal Data , 2010, Biometrics.

[10]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[11]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[12]  Ciprian M. Crainiceanu,et al.  Bayesian Analysis for Penalized Spline Regression Using WinBUGS , 2005 .

[13]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[14]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[15]  Marc De Hert,et al.  Metabolic and cardiovascular adverse effects associated with antipsychotic drugs , 2012, Nature Reviews Endocrinology.

[16]  David B. Dunson,et al.  Improving prediction from dirichlet process mixtures via enrichment , 2014, J. Mach. Learn. Res..

[17]  Julien Jacques,et al.  Functional data clustering: a survey , 2013, Advances in Data Analysis and Classification.

[18]  Babak Shahbaba,et al.  Nonlinear Models Using Dirichlet Process Mixtures , 2007, J. Mach. Learn. Res..

[19]  J. Newcomer Second-Generation (Atypical) Antipsychotics and Metabolic Effects , 2005, CNS drugs.

[20]  Peter Müller,et al.  A Bayesian Population Model with Hierarchical Mixture Priors Applied to Blood Count Data , 1997 .

[21]  Fernando A. Quintana,et al.  Bayesian Nonparametric Longitudinal Data Analysis , 2016, Journal of the American Statistical Association.

[22]  B. Kestenbaum,et al.  Rapid decline of kidney function increases cardiovascular risk in the elderly. , 2009, Journal of the American Society of Nephrology : JASN.

[23]  Jenna Wong,et al.  Using Machine Learning to Identify Health Outcomes from Electronic Health Record Data , 2018, Current Epidemiology Reports.

[24]  T. Ferguson BAYESIAN DENSITY ESTIMATION BY MIXTURES OF NORMAL DISTRIBUTIONS , 1983 .

[25]  Bruno Scarpa,et al.  Enriched Stick-Breaking Processes for Functional Data , 2014, Journal of the American Statistical Association.

[26]  Warren B. Powell,et al.  Dirichlet Process Mixtures of Generalized Linear Models , 2009, J. Mach. Learn. Res..

[27]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[28]  Scott L. Zeger,et al.  Generalized linear models with random e ects: a Gibbs sampling approach , 1991 .

[29]  Jason Roy,et al.  Missing laboratory results data in electronic health databases: implications for monitoring diabetes risk. , 2017, Journal of comparative effectiveness research.

[30]  Sonia Petrone,et al.  An enriched conjugate prior for Bayesian nonparametric inference , 2011 .

[31]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .