Multiple imputation for missing data via sequential regression trees.

Multiple imputation is particularly well suited to deal with missing data in large epidemiologic studies, because typically these studies support a wide range of analyses by many data users. Some of these analyses may involve complex modeling, including interactions and nonlinear relations. Identifying such relations and encoding them in imputation models, for example, in the conditional regressions for multiple imputation via chained equations, can be daunting tasks with large numbers of categorical and continuous variables. The authors present a nonparametric approach for implementing multiple imputation via chained equations by using sequential regression trees as the conditional models. This has the potential to capture complex relations with minimal tuning by the data imputer. Using simulations, the authors demonstrate that the method can result in more plausible imputations, and hence more reliable inferences, in complex settings than the naive application of standard sequential regression imputation techniques. They apply the approach to impute missing values in data on adverse birth outcomes with more than 100 clinical and survey variables. They evaluate the imputations using posterior predictive checks with several epidemiologic analyses of interest.

[1]  T. Kamarck,et al.  Measuring the Functional Components of Social Support , 1985 .

[2]  Irwin G. Sarason,et al.  Social support : theory, research and applications , 1985 .

[3]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[4]  Raul Cano On The Bayesian Bootstrap , 1992 .

[5]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[6]  T. Speed,et al.  Characterizing a joint probability distribution by conditionals , 1993 .

[7]  Xiao-Li Meng,et al.  Posterior Predictive $p$-Values , 1994 .

[8]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[9]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[10]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[11]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[12]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[13]  Trevillore E. Raghunathan,et al.  IVEware: Imputation and Variance Estimation Software User Guide , 2002 .

[14]  Claudio Conversano,et al.  Missing Data Incremental Imputation through Tree Based Methods , 2002, COMPSTAT.

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[17]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[18]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[19]  Ingo Ruczinski,et al.  Imputation Methods to Improve Inference in Snp Association Studies , 2022 .

[20]  J. Graham,et al.  How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory , 2007, Prevention Science.

[21]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[22]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[23]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[24]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[25]  Andrew Gelman,et al.  Diagnostics for multivariate imputations , 2007 .

[26]  Stephen R Cole,et al.  Use of multiple imputation in the epidemiologic literature. , 2008, American journal of epidemiology.

[27]  Elizabeth A Stuart,et al.  American Journal of Epidemiology Practice of Epidemiology Multiple Imputation with Large Data Sets: a Case Study of the Children's Mental Health Initiative , 2022 .

[28]  A. Zaslavsky,et al.  Multiple imputation in a large-scale complex survey: a practical guide , 2010, Statistical methods in medical research.

[29]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .