Predicting Missing Values in Medical Data Via XGBoost Regression

Purpose The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables. Method We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset. Result The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Conclusion Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.

[1]  Brett K. Beaulieu-Jones,et al.  Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis , 2017, bioRxiv.

[2]  Shelley A. Rusincovitch,et al.  Clinical Research Informatics and Electronic Health Record Data , 2014, Yearbook of Medical Informatics.

[3]  Peter Szolovits,et al.  3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data , 2017, J. Am. Medical Informatics Assoc..

[4]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[5]  João Miguel da Costa Sousa,et al.  Missing data in medical databases: Impute, delete or classify? , 2013, Artif. Intell. Medicine.

[6]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[7]  Hans-Ulrich Prokosch,et al.  Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence , 2013, BMC Medical Informatics and Decision Making.

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  Fei Tang,et al.  Random forest missing data algorithms , 2017, Stat. Anal. Data Min..

[10]  Guangyu Zhang,et al.  Extensions of the Penalized Spline of Propensity Prediction Method of Imputation , 2009, Biometrics.

[11]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[12]  Peng Li,et al.  Multiple Imputation: A Flexible Tool for Handling Missing Data. , 2015, JAMA.

[13]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[14]  Keith Marsolo,et al.  Biases introduced by filtering electronic health records for patients with “complete data” , 2017, J. Am. Medical Informatics Assoc..

[15]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[16]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[17]  Yi Deng,et al.  Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data , 2016, Scientific Reports.

[18]  R. Little,et al.  Robust Likelihood-based Analysis of Multivariate Data with Missing Values , 2003 .

[19]  R S Evans,et al.  Electronic Health Records: Then, Now, and in the Future , 2016, Yearbook of Medical Informatics.

[20]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[21]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[22]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[23]  Zhen Hu,et al.  Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record , 2017, J. Biomed. Informatics.

[24]  Peter Szolovits,et al.  Using Machine Learning to Predict Laboratory Test Results. , 2016, American journal of clinical pathology.

[25]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.

[26]  Noémie Elhadad,et al.  Identifying and mitigating biases in EHR laboratory tests , 2014, J. Biomed. Informatics.

[27]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[28]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[29]  Hongan Wang,et al.  Missing Data Imputation: A Fuzzy K-means Clustering Algorithm over Sliding Window , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[30]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.