Informative presence and observation in routine health data: A review of methodology for clinical risk prediction

Abstract Objective Informative presence (IP) is the phenomenon whereby the presence or absence of patient data is potentially informative with respect to their health condition, with informative observation (IO) being the longitudinal equivalent. These phenomena predominantly exist within routinely collected healthcare data, in which data collection is driven by the clinical requirements of patients and clinicians. The extent to which IP and IO are considered when using such data to develop clinical prediction models (CPMs) is unknown, as is the existing methodology aiming at handling these issues. This review aims to synthesize such existing methodology, thereby helping identify an agenda for future methodological work. Materials and Methods A systematic literature search was conducted by 2 independent reviewers using prespecified keywords. Results Thirty-six articles were included. We categorized the methods presented within as derived predictors (including some representation of the measurement process as a predictor in the model), modeling under IP, and latent structures. Including missing indicators or summary measures as predictors is the most commonly presented approach amongst the included studies (24 of 36 articles). Discussion This is the first review to collate the literature in this area under a prediction framework. A considerable body relevant of literature exists, and we present ways in which the described methods could be developed further. Guidance is required for specifying the conditions under which each method should be used to enable applied prediction modelers to use these methods. Conclusions A growing recognition of IP and IO exists within the literature, and methodology is increasingly becoming available to leverage these phenomena for prediction purposes. IP and IO should be approached differently in a prediction context than when the primary goal is explanation. The work included in this review has demonstrated theoretical and empirical benefits of incorporating IP and IO, and therefore we recommend that applied health researchers consider incorporating these methods in their work.

[1]  Glen P Martin,et al.  Using marginal structural models to adjust for treatment drop‐in when developing clinical prediction models , 2017, Statistics in medicine.

[2]  Michael J Pencina,et al.  A comparison of risk prediction methods using repeated observations: an application to electronic health records for hemodialysis. , 2017, Statistics in medicine.

[3]  M. Gabbouj,et al.  Sepsis Prediction in Intensive Care Unit Using Ensemble of XGboost Models , 2019, 2019 Computing in Cardiology (CinC).

[4]  Utkarsh Upadhyay,et al.  Recurrent Marked Temporal Point Processes: Embedding Event History to Vector , 2016, KDD.

[5]  I. Kohane,et al.  Biases in electronic health record data due to processes within the healthcare system: retrospective observational study , 2018, British Medical Journal.

[6]  Ruwanthi Kolamunnage-Dona,et al.  Bayesian joint modelling of longitudinal and time to event data: a methodological review , 2020, BMC Medical Research Methodology.

[7]  L. Mbuagbaw,et al.  A call for consensus guidelines on classification and reporting of methodological studies. , 2020, Journal of clinical epidemiology.

[8]  Mihaela van der Schaar,et al.  Learning from Clinical Judgments: Semi-Markov-Modulated Marked Hawkes Processes for Risk Prognosis , 2017, ICML.

[9]  Sheng Luo,et al.  Dynamic predictions in Bayesian functional joint models for longitudinal and time-to-event data: An application to Alzheimer’s disease , 2019, Statistical methods in medical research.

[10]  Glen P Martin,et al.  Missing data should be handled differently for prediction than for description or causal explanation. , 2020, Journal of clinical epidemiology.

[11]  Matthew Sperrin,et al.  Towards a Framework for the Design, Implementation and Reporting of Methodology Scoping Reviews , 2020, Journal of clinical epidemiology.

[12]  Angela M Wood,et al.  The use of repeated blood pressure measures for cardiovascular risk prediction: a comparison of statistical models in the ARIC study , 2016, Statistics in medicine.

[13]  Hongfang Liu,et al.  Modeling asynchronous event sequences with RNNs , 2018, J. Biomed. Informatics.

[14]  Charles E McCulloch,et al.  Analysis of longitudinal data from outcome‐dependent visit processes: Failure of proposed methods in realistic settings and potential improvements , 2018, Statistics in medicine.

[15]  Xingqiu Zhao,et al.  Semiparametric Regression Analysis of Longitudinal Data With Informative Observation Times , 2005 .

[16]  Yong Xu,et al.  VS-GRU: A Variable Sensitive Gated Recurrent Neural Network for Multivariate Time Series with Massive Missing Values , 2019, Applied Sciences.

[17]  Christian R. Shelton,et al.  Modeling "Presentness" of Electronic Health Record Data to Improve Patient State Estimation , 2018, MLHC.

[18]  Misha Pavel,et al.  Time-series modeling of long-term weight self-monitoring data , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[19]  S. Haneuse,et al.  A General Framework for Considering Selection Bias in EHR-Based Studies: What Data Are Observed and Why? , 2016, EGEMS.

[20]  Christian R. Shelton,et al.  Marked Point Process for Severity of Illness Assessment , 2017, MLHC.

[21]  H. Jacqmin-Gadda,et al.  Joint nested frailty models for clustered recurrent and terminal events: An application to colonoscopy screening visits and colorectal cancer risks in Lynch Syndrome families , 2020, Statistical methods in medical research.

[22]  Matthew Phelan,et al.  How and when informative visit processes can bias inference when using electronic health records data for clinical research , 2019, J. Am. Medical Informatics Assoc..

[23]  Matthew Sperrin,et al.  Informative Observation in Health Data: Association of Past Level and Trend with Time to Next Measurement. , 2017, Studies in health technology and informatics.

[24]  Qingxia Chen,et al.  Dealing with missing predictor values when applying clinical prediction models. , 2009, Clinical chemistry.

[25]  Peter J. Haug,et al.  Exploiting missing clinical data in Bayesian network modeling for predicting medical problems , 2008, J. Biomed. Informatics.

[26]  Rebecca A Hubbard,et al.  A Bayesian latent class approach for EHR‐based phenotyping , 2018, Statistics in medicine.

[27]  Marcus A. Badgeley,et al.  Deep learning predicts hip fracture using confounding patient and healthcare variables , 2018, npj Digital Medicine.

[28]  Zhongheng Zhang,et al.  Healthcare processes of laboratory tests for the prediction of mortality in the intensive care unit: a retrospective study based on electronic healthcare records in the USA , 2019, BMJ Open.

[29]  Anis Sharafoddini,et al.  Patient Similarity in Prediction Models Based on Health Data: A Scoping Review , 2017, JMIR medical informatics.

[30]  L. M. Barclay,et al.  Chain Event Graphs for Informed Missingness , 2014 .

[31]  Lei Liu,et al.  Analysis of Longitudinal Data in the Presence of Informative Observational Times and a Dependent Terminal Event, with Application to Medical Cost Data , 2008, Biometrics.

[32]  Virginie Rondeau,et al.  Joint model for left‐censored longitudinal data, recurrent events and terminal event: Predictive abilities of tumor burden for cancer evolution with application to the FFCD 2000–05 trial , 2016, Biometrics.

[33]  James Y. Zou,et al.  Embedding for Informative Missingness: Deep Learning With Incomplete Data , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[34]  Michael J Pencina,et al.  Controlling for Informed Presence Bias Due to the Number of Health Encounters in an Electronic Health Record. , 2016, American journal of epidemiology.

[35]  Jinsung Yoon,et al.  Dynamic Prediction in Clinical Survival Analysis Using Temporal Convolutional Networks , 2020, IEEE Journal of Biomedical and Health Informatics.

[36]  Eleanor M Pullenayegum,et al.  Longitudinal data subject to irregular observation: A review of methods with a focus on visit processes, assumptions, and study design , 2016, Statistical methods in medical research.

[37]  Rolf H. H. Groenwold,et al.  Informative missingness in electronic health record systems: the curse of knowing , 2020, Diagnostic and Prognostic Research.

[38]  J. Blume,et al.  Missing data and prediction: the pattern submodel , 2018, Biostatistics.

[39]  Yoshihide Sawada,et al.  Improving RNN Performance by Modelling Informative Missingness with Combined Indicators , 2019, Applied Sciences.

[40]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[41]  Chunhua Weng,et al.  Sick Patients Have More Data: The Non-Random Completeness of Electronic Health Records , 2013, AMIA.

[42]  May D. Wang,et al.  A Novel Temporal Similarity Measure for Patients Based on Irregularly Measured Data in Electronic Health Records , 2016, BCB.

[43]  Benjamin A. Goldstein,et al.  Illustrating Informed Presence Bias in Electronic Health Records Data: How Patient Interactions with a Health System Can Impact Inference , 2017, EGEMS.

[44]  Glen P. Martin,et al.  Harnessing repeated measurements of predictor variables for clinical risk prediction: a review of existing methods , 2020, Diagnostic and Prognostic Research.

[45]  Analyzing longitudinal data with informative observation and terminal event times , 2016 .

[46]  Benjamin French,et al.  Regression modeling of longitudinal data with outcome‐dependent observation times: extensions and comparative evaluation , 2014, Statistics in medicine.

[47]  Aaron J Fisher,et al.  A Bayesian hierarchical model for prediction of latent health states from multiple data sources with application to active surveillance of prostate cancer , 2015, Biometrics.

[48]  David C. Kale,et al.  Modeling Missing Data in Clinical Time Series with RNNs , 2016 .

[49]  Graeme L. Hickey,et al.  Joint modelling of time-to-event and multivariate longitudinal outcomes: recent developments and issues , 2016, BMC Medical Research Methodology.

[50]  Satya Narayan Shukla,et al.  Prediction and imputation in irregularly sampled clinical time series data using hierarchical linear dynamical models , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[51]  Liuquan Sun,et al.  A Joint Modeling Approach for Longitudinal Data with Informative Observation Times and a Terminal Event , 2018, Statistics in Biosciences.

[52]  David J. Hand,et al.  Good methods for coping with missing data in decision trees , 2008, Pattern Recognit. Lett..

[53]  Anis Sharafoddini,et al.  A New Insight Into Missing Data in Intensive Care Unit Patient Profiles: Observational Study , 2018, JMIR medical informatics.

[54]  Dimitris Rizopoulos,et al.  Dynamic Predictions and Prospective Accuracy in Joint Models for Longitudinal and Time‐to‐Event Data , 2011, Biometrics.

[55]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[56]  J. Vincent,et al.  The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure , 1996, Intensive Care Medicine.

[57]  Keith R. Abrams,et al.  Mixed‐effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study , 2018, Statistica Neerlandica.

[58]  J. Kirkham A comparison of hospital performance with non‐ignorable missing covariates: An application to trauma care data , 2008, Statistics in medicine.

[59]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[60]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[61]  Bin Zhang,et al.  Bayesian nonparametric inference for panel count data with an informative observation process. , 2018, Biometrical journal. Biometrische Zeitschrift.

[62]  Nanhua Zhang,et al.  A joint model of binary and longitudinal data with non-ignorable missingness, with application to marital stress and late-life major depression in women , 2014 .

[63]  Gabriel Escarela,et al.  Addressing missing covariates for the regression analysis of competing risks: Prognostic modelling for triaging patients diagnosed with prostate cancer , 2016, Statistical methods in medical research.

[64]  John M Neuhaus,et al.  Biased and unbiased estimation in longitudinal studies with informative visit processes , 2016, Biometrics.

[65]  Maarten van Smeden,et al.  A cautionary note on the use of the missing indicator method for handling missing data in prediction research. , 2020, Journal of clinical epidemiology.

[66]  Nicolás García-Pedrajas,et al.  Nonlinear Boosting Projections for Ensemble Construction , 2007, J. Mach. Learn. Res..

[67]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[68]  Beng Chin Ooi,et al.  Resolving the Bias in Electronic Medical Records , 2017, KDD.

[69]  Jing Zhao,et al.  Handling Temporality of Clinical Events for Drug Safety Surveillance , 2015, AMIA.

[70]  Panagiotis Papapetrou,et al.  A classification framework for exploiting sparse multi-variate temporal features with application to adverse drug event detection in medical records , 2019, BMC Medical Informatics and Decision Making.

[71]  J. Hurley Forrest plots or caterpillar plots? , 2020, Journal of clinical epidemiology.