A fair comparison of tree‐based and parametric methods in multiple imputation by chained equations

Multiple imputation by chained equations (MICE) has emerged as a leading strategy for imputing missing epidemiological data due to its ease of implementation and ability to maintain unbiased effect estimates and valid inference. Within the MICE algorithm, imputation can be performed using a variety of parametric or nonparametric methods. Literature has suggested that nonparametric tree-based imputation methods outperform parametric methods in terms of bias and coverage when there are interactions or other nonlinear effects among the variables. However, these studies fail to provide a fair comparison as they do not follow the well-established recommendation that any effects in the final analysis model (including interactions) should be included in the parametric imputation model. We show via simulation that properly incorporating interactions in the parametric imputation model leads to much better performance. In fact, correctly specified parametric imputation and tree-based random forest imputation perform similarly when estimating the interaction effect. Parametric imputation leads to slightly higher coverage for the interaction effect, but it has wider confidence intervals than random forest imputation and requires correct specification of the imputation model. Epidemiologists should take care in specifying MICE imputation models, and this paper assists in that task by providing a fair comparison of parametric and tree-based imputation in MICE.

[1]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[2]  R. Petersen,et al.  Development of Cognitive Instruments for Use in Clinical Trials of Antidementia Drugs: Additions to the Alzheimer's Disease Assessment Scale That Broaden Its Scope , 1997, Alzheimer disease and associated disorders.

[3]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[4]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[5]  Andrew Simmons,et al.  The effects of intracranial volume adjustment approaches on multiple regional MRI volumes in healthy aging and Alzheimer's disease , 2014, Front. Aging Neurosci..

[6]  Patrick Royston,et al.  Tuning multiple imputation by predictive mean matching and local residual draws , 2014, BMC Medical Research Methodology.

[7]  I. White,et al.  Eliciting and using expert opinions about dropout bias in randomized controlled trials , 2007, Clinical trials.

[8]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[9]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[10]  Ian R White,et al.  Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods , 2012, BMC Medical Research Methodology.

[11]  Michael G Kenward,et al.  Multiple imputation: current perspectives , 2007, Statistical methods in medical research.

[12]  Patrick Royston,et al.  Multiple Imputation by Chained Equations (MICE): Implementation in Stata , 2011 .

[13]  James R. Carpenter,et al.  Appropriate inclusion of interactions was needed to avoid bias in multiple imputation , 2016, Journal of clinical epidemiology.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[16]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[17]  Paul T. von Hippel,et al.  HOW TO IMPUTE INTERACTIONS, SQUARES, AND OTHER TRANSFORMED VARIABLES , 2009 .

[18]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .