A deep learning-based, unsupervised method to impute missing values in electronic health records for improved patient management

Electronic health records (EHRs) often suffer missing values, for which recent advances in deep learning offer a promising remedy. We develop a deep learning-based, unsupervised method to impute missing values in patient records, then examine its imputation effectiveness and predictive efficacy for peritonitis patient management. Our method builds on a deep autoencoder framework, incorporates missing patterns, accounts for essential relationships in patient data, considers temporal patterns common to patient records, and employs a novel loss function for error calculation and regularization. Using a data set of 27,327 patient records, we perform a comparative evaluation of the proposed method and several prevalent benchmark techniques. The results indicate the greater imputation performance of our method relative to all the benchmark techniques, recording 5.3%-15.5% lower imputation errors. Furthermore, the data imputed by the proposed method better predict readmission, length of stay, and mortality than those obtained from any benchmark techniques, achieving 2.7%-11.5% improvements in predictive efficacy. The illustrated evaluation indicates the proposed method's viability, imputation effectiveness, and clinical decision support utilities. Overall, our method can reduce imputation biases and be applied to various missing value scenarios clinically, thereby empowering physicians and researchers to better analyze and utilize EHRs for improved patient management.

[1]  Kevin Fauvel,et al.  Towards Sustainable Dairy Management - A Machine Learning Enhanced Method for Estrus Detection , 2019, KDD.

[2]  Laurent Gatto,et al.  Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. , 2016, Journal of proteome research.

[3]  Stef van Buuren,et al.  Multiple imputation of discrete and continuous data by fully conditional specification , 2007 .

[4]  J. Ancker,et al.  The Invisible Work of Personal Health Information Management Among People With Multiple Chronic Conditions: Qualitative Interview Study Among Patients and Providers , 2015, Journal of medical Internet research.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Yiming Yang,et al.  Deep Learning for Epidemiological Predictions , 2018, SIGIR.

[7]  Gerard M Schippers,et al.  UvA-DARE ( Digital Academic Repository ) Missing Data Approaches in eHealth Research : Simulation Study and a Tutorial for Nonmathematically Inclined Researchers , 2011 .

[8]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[9]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[10]  Hadi Kharrazi,et al.  Prospective EHR-Based Clinical Trials: The Challenge of Missing Data , 2014, Journal of General Internal Medicine.

[11]  Karel G M Moons,et al.  Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis , 2012, Canadian Medical Association Journal.

[12]  Susan M Shortreed,et al.  Estimating the effect of long-term physical activity on cardiovascular disease and mortality: evidence from the Framingham Heart Study , 2013, Heart.

[13]  R. Marimont,et al.  Nearest Neighbour Searches and the Curse of Dimensionality , 1979 .

[14]  Noémie Elhadad,et al.  Identifying and mitigating biases in EHR laboratory tests , 2014, J. Biomed. Informatics.

[15]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[16]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[17]  Michael M. Vigoda,et al.  Future of electronic health records: implications for decision support. , 2012, The Mount Sinai journal of medicine, New York.

[18]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[19]  John B. Carlin,et al.  Bias and efficiency of multiple imputation compared with complete‐case analysis for missing covariate values , 2010, Statistics in medicine.

[20]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[21]  Yi Deng,et al.  Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data , 2016, Scientific Reports.

[22]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[23]  Kaveh G Shojania,et al.  The impact of adverse events in the intensive care unit on hospital mortality and length of stay , 2008, BMC health services research.

[24]  B. Estambale,et al.  Mathematical modelling of liver cancer in Western Kenya , 2017 .

[25]  M. Rutter,et al.  Bias resulting from missing information: some epidemiological findings. , 1977, British journal of preventive & social medicine.

[26]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[27]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[28]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[29]  Sharon Swee-Lin Tan,et al.  Electronic Health Records: How Can IS Researchers Contribute to Transforming Healthcare? , 2016, MIS Q..

[30]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[31]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[32]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33]  Edward McAuley,et al.  Predicting long-term maintenance of physical activity in older adults. , 2003, Preventive medicine.

[34]  Michael D. Greenberg,et al.  Too Many Alerts, Too Much Liability: Sorting Through the Malpractice Implications of Drug-Drug Interaction Clinical Decision Support , 2012 .

[35]  Jaeyoung Shin,et al.  Random Subspace Ensemble Learning for Functional Near-Infrared Spectroscopy Brain-Computer Interfaces , 2020, Frontiers in Human Neuroscience.

[36]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[37]  Ze Li,et al.  Practice of a New Model Fusion Structure in Short Video Recommendation , 2019, 2019 International Conference on Virtual Reality and Intelligent Systems (ICVRIS).

[38]  Runmin Wei,et al.  Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data , 2018, Scientific Reports.

[39]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[40]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[41]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[42]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[43]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[44]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.