A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction

Background The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients. Methods We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors. Results The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers. Conclusions The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.

[1]  Ben Wellner,et al.  Predicting Unplanned Transfers to the Intensive Care Unit: A Machine Learning Approach Leveraging Diverse Clinical Elements , 2017, JMIR medical informatics.

[2]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[3]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[4]  Chad Bradford,et al.  Patient and clinical characteristics that heighten risk for heart failure readmission , 2017, Research in social & administrative pharmacy : RSAP.

[5]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[6]  S. Ratcliffe,et al.  An Empirical Derivation of the Optimal Time Interval for Defining ICU Readmissions , 2013, Medical care.

[7]  David Zygun,et al.  A systematic review of tools for predicting severe adverse events following patient discharge from intensive care units , 2013, Critical Care.

[8]  Mohamed F. Ghalwash,et al.  eXITs: An Ensemble Approach for Imputing Missing EHR Data , 2019, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[9]  Alessandro Tredicucci,et al.  Corrigendum: Universal lineshapes at the crossover between weak and strong critical coupling in Fano-resonant coupled oscillators , 2016, Scientific Reports.

[10]  Arun Sundararaman,et al.  Novel Approach to Predict Hospital Readmissions Using Feature Selection from Unstructured Data with Class Imbalance , 2018, Big Data Res..

[11]  Xiaoyong Du,et al.  A novel Bayesian classification for uncertain data , 2011, Knowl. Based Syst..

[12]  S. Negahban,et al.  Analysis of Machine Learning Techniques for Heart Failure Readmissions , 2016, Circulation. Cardiovascular quality and outcomes.

[13]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[14]  George Hripcsak,et al.  Caveats for the use of operational electronic health record data in comparative effectiveness research. , 2013, Medical care.

[15]  Mohammed Bennamoun,et al.  Machine learning‐based prediction of heart failure readmission or death: implications of choosing the right model and the right metrics , 2019, ESC heart failure.

[16]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[17]  Brett K. Beaulieu-Jones,et al.  Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis , 2017, bioRxiv.

[18]  Varun Chandola,et al.  Tree-based Regularization for Interpretable Readmission Prediction , 2019, AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering.

[19]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[20]  Harlan M Krumholz,et al.  Statistical models and patient predictors of readmission for heart failure: a systematic review. , 2008, Archives of internal medicine.

[21]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[22]  Han-Xiong Li,et al.  Probabilistic support vector machines for classification of noise affected data , 2013, Inf. Sci..

[23]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[24]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[25]  Li Liang,et al.  Prediction of 30-Day All-Cause Readmissions in Patients Hospitalized for Heart Failure: Comparison of Machine Learning and Other Statistical Approaches , 2017, JAMA cardiology.

[26]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[27]  Neil D. Lawrence,et al.  Bayesian Gaussian Process Latent Variable Model , 2010, AISTATS.

[28]  Yaniv Kerem,et al.  Cost and mortality impact of an algorithm-driven sepsis prediction system , 2017, Journal of medical economics.

[29]  Christopher M O'Connor,et al.  High Heart Failure Readmission Rates: Is It the Health System's Fault? , 2017, JACC. Heart failure.

[30]  TibshiraniRobert,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010 .

[31]  Harry Hemingway,et al.  Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease , 2018, bioRxiv.

[32]  R. Campbell,et al.  Analysis and prediction of unplanned intensive care unit readmission using recurrent neural networks with long short-term memory , 2018, bioRxiv.

[33]  A. Khera,et al.  Forecasting the Future of Cardiovascular Disease in the United States: A Policy Statement From the American Heart Association , 2011, Circulation.

[34]  David Sontag,et al.  Learning Low-Dimensional Representations of Medical Concepts , 2016, CRI.

[35]  Y. Skaik Understanding and using sensitivity, specificity and predictive values , 2008, Indian journal of ophthalmology.