Latent Patient Cluster Discovery for Robust Future Forecasting and New-Patient Generalization

Commonly referred to as predictive modeling, the use of machine learning and statistical methods to improve healthcare outcomes has recently gained traction in biomedical informatics research. Given the vast opportunities enabled by large Electronic Health Records (EHR) data and powerful resources for conducting predictive modeling, we argue that it is yet crucial to first carefully examine the prediction task and then choose predictive methods accordingly. Specifically, we argue that there are at least three distinct prediction tasks that are often conflated in biomedical research: 1) data imputation, where a model fills in the missing values in a dataset, 2) future forecasting, where a model projects the development of a medical condition for a known patient based on existing observations, and 3) new-patient generalization, where a model transfers the knowledge learned from previously observed patients to newly encountered ones. Importantly, the latter two tasks—future forecasting and new-patient generalizations—tend to be more difficult than data imputation as they require predictions to be made on potentially out-of-sample data (i.e., data following a different predictable pattern from what has been learned by the model). Using hearing loss progression as an example, we investigate three regression models and show that the modeling of latent clusters is a robust method for addressing the more challenging prediction scenarios. Overall, our findings suggest that there exist significant differences between various kinds of prediction tasks and that it is important to evaluate the merits of a predictive model relative to the specific purpose of a prediction task.

[1]  Mykola Pechenizkiy,et al.  Handling Local Concept Drift with Dynamic Integration of Classifiers: Domain of Antibiotic Resistance in Nosocomial Infections , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[2]  Marcus A. Maloof,et al.  A Bayesian Approach to Concept Drift , 2010, NIPS.

[3]  Friedrich Leisch,et al.  Fitting finite mixtures of generalized linear regressions in R , 2007, Comput. Stat. Data Anal..

[4]  Raj M. Ratwani,et al.  Exploring methods for identifying related patient safety events using structured and unstructured data , 2015, J. Biomed. Informatics.

[5]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[6]  Huilong Duan,et al.  Latent Treatment Pattern Discovery for Clinical Processes , 2013, Journal of Medical Systems.

[7]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[8]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[9]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[10]  Tony Hsiu-Hsi Chen,et al.  Predictive model for progression of hearing loss: meta-analysis of multi-state outcome. , 2009, Journal of evaluation in clinical practice.

[11]  S. Pocock,et al.  Coping with missing data in clinical trials: A model‐based approach applied to asthma trials , 2002, Statistics in medicine.

[12]  Trivellore E Raghunathan,et al.  What do we do with missing data? Some options for analysis of incomplete data. , 2004, Annual review of public health.

[13]  Paul Sajda,et al.  Machine learning for detection and diagnosis of disease. , 2006, Annual review of biomedical engineering.

[14]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[16]  Y Vergouwe,et al.  Updating methods improved the performance of a clinical prediction model in new patients. , 2008, Journal of clinical epidemiology.

[17]  Jyotishman Pathak,et al.  Developing EHR-driven heart failure risk prediction models using CPXR(Log) with the probabilistic loss function , 2016, J. Biomed. Informatics.

[18]  Nigam H. Shah,et al.  Implications of non-stationarity on predictive modeling using EHRs , 2015, J. Biomed. Informatics.

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[21]  Guanhua Chen,et al.  Calibration Drift of Clinical Prediction Models Across Modeling Methods , 2016, CRI.

[22]  Andrew Steptoe,et al.  Happiness, health, and social networks : Psychosocial determinants of health may transfer through social connections , 2009 .

[23]  Indre Zliobaite,et al.  Learning under Concept Drift: an Overview , 2010, ArXiv.

[24]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[25]  J. Hall,et al.  Hearing Loss Prediction by the Acoustic Reflex: Comparison of Seven Methods , 1981, Ear and hearing.

[26]  Zhengxing Huang,et al.  On mining latent topics from healthcare chat logs , 2016, J. Biomed. Informatics.

[27]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[28]  D. Aldous Exchangeability and related topics , 1985 .

[29]  William Speier,et al.  Using phrases and document metadata to improve topic modeling of clinical reports , 2016, J. Biomed. Informatics.

[30]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[31]  Chien-Lung Hsu,et al.  Designing an Intelligent Health Monitoring System and Exploring User Acceptance for the Elderly , 2013, Journal of Medical Systems.

[32]  École d'été de probabilités de Saint-Flour,et al.  École d'été de probabilités de Saint-Flour XIII - 1983 , 1985 .

[33]  Peter J. Haug,et al.  Exploiting missing clinical data in Bayesian network modeling for predicting medical problems , 2008, J. Biomed. Informatics.

[34]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..