Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis

Background Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. Objective The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. Methods We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). Results Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. Conclusions The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

[1]  Krishnan Bhaskaran,et al.  What is the difference between missing completely at random and missing at random? , 2014, International journal of epidemiology.

[2]  Ming Ouyang,et al.  A meta-data based method for DNA microarray imputation , 2007, BMC Bioinformatics.

[3]  Todd E. Bodner,et al.  What Improves with Increased Missing Data Imputations? , 2008 .

[4]  Lorenzo Beretta,et al.  Nearest neighbor imputation algorithms: a critical evaluation , 2016, BMC Medical Informatics and Decision Making.

[5]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[8]  Elizabeth A Stuart,et al.  American Journal of Epidemiology Practice of Epidemiology Multiple Imputation with Large Data Sets: a Case Study of the Children's Mental Health Initiative , 2022 .

[9]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[10]  Brett K. Beaulieu-Jones,et al.  Reproducibility of computational workflows is automated using continuous analysis , 2017, Nature Biotechnology.

[11]  Mark Helfand,et al.  Screening for Lipid Disorders in Adults: Selective Update of 2001 US Preventive Services Task Force Review , 2008 .

[12]  R. Steinbrook Health care and the American Recovery and Reinvestment Act. , 2009, The New England journal of medicine.

[13]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.

[14]  Mark Bounthavong,et al.  Approach to Addressing Missing Data for Electronic Medical Records and Pharmacy Claims Data Research , 2015, Pharmacotherapy.

[15]  Vladimir Pestov,et al.  Is the kk-NN classifier in high dimensions affected by the curse of dimensionality? , 2011, Comput. Math. Appl..

[16]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[17]  Sergey Feldman,et al.  fancyimpute: Version 0.0.16 , 2016 .

[18]  James R Carpenter,et al.  Sensitivity analysis after multiple imputation under missing at random: a weighting approach , 2007, Statistical methods in medical research.

[19]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[20]  C. McDonald,et al.  LOINC, a universal standard for identifying laboratory observations: a 5-year update. , 2003, Clinical chemistry.

[21]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.

[22]  Louisa Flintoft,et al.  Disease genetics: Phenome-wide association studies go large , 2014, Nature Reviews Genetics.

[23]  Yann Le Strat,et al.  Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data , 2012, BMC Medical Research Methodology.

[24]  Alexander Robitzsch,et al.  Some Additional Multiple Imputation Functions, Especially for'mice' , 2015 .