Sequential BART for imputation of missing covariates.

To conduct comparative effectiveness research using electronic health records (EHR), many covariates are typically needed to adjust for selection and confounding biases. Unfortunately, it is typical to have missingness in these covariates. Just using cases with complete covariates will result in considerable efficiency losses and likely bias. Here, we consider the covariates missing at random with missing data mechanism either depending on the response or not. Standard methods for multiple imputation can either fail to capture nonlinear relationships or suffer from the incompatibility and uncongeniality issues. We explore a flexible Bayesian nonparametric approach to impute the missing covariates, which involves factoring the joint distribution of the covariates with missingness into a set of sequential conditionals and applying Bayesian additive regression trees to model each of these univariate conditionals. Using data augmentation, the posterior for each conditional can be sampled simultaneously. We provide details on the computational algorithm and make comparisons to other methods, including parametric sequential imputation and two versions of multiple imputation by chained equations. We illustrate the proposed approach on EHR data from an affiliated tertiary care institution to examine factors related to hyperglycemia.

[1]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[2]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[3]  Alexander Hehmeyer,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[4]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[5]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[6]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[7]  Michael A. West,et al.  Bayesian CART: Prior Specification and Posterior Simulation , 2007 .

[8]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[9]  Yaming Yu,et al.  Imputing Missing Data by Fully Conditional Models : Some Cautionary Examples and Guidelines , 2012 .

[10]  Jerome P. Reiter,et al.  Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence , 2014, 1410.0438.

[11]  D. Rubin,et al.  MULTIPLE IMPUTATIONS IN SAMPLE SURVEYS-A PHENOMENOLOGICAL BAYESIAN APPROACH TO NONRESPONSE , 2002 .

[12]  A. Gelman,et al.  ON THE STATIONARY DISTRIBUTION OF ITERATIVE IMPUTATIONS , 2010, 1012.2902.

[13]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[14]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[15]  Donald B. Rubin,et al.  Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program , 2014 .

[16]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[17]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[18]  Marco Di Zio,et al.  Imputation through finite Gaussian mixture models , 2007, Comput. Stat. Data Anal..

[19]  B. Arnold,et al.  Compatible Conditional Distributions , 1989 .

[20]  James R. Gattiker,et al.  Parallel Bayesian Additive Regression Trees , 2013, 1309.1906.

[21]  Adrian F. M. Smith,et al.  A Bayesian CART algorithm , 1998 .

[22]  Jerome P. Reiter,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[23]  Joseph G. Ibrahim,et al.  Missing covariates in generalized linear models when the missing data mechanism is non‐ignorable , 1999 .

[24]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[25]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[26]  J. Ibrahim,et al.  Semiparametric Models for Missing Covariate and Response Data in Regression Models , 2006, Biometrics.

[27]  Jerome P. Reiter,et al.  Multiple Imputation of Missing or Faulty Values Under Linear Constraints , 2014 .

[28]  M J Daniels,et al.  Fully Bayesian inference under ignorable missingness in the presence of auxiliary covariates , 2014, Biometrics.

[29]  Joseph G. Ibrahim,et al.  A conditional model for incomplete covariates in parametric regression models , 1996 .

[30]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .