An Empirical Comparison of Multiple Imputation Methods for Categorical Data

ABSTRACT Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. Supplementary material for this article is available online.

[1]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[2]  Jerome P. Reiter,et al.  Bayesian multiple imputation for large-scale categorical data with structural zeros , 2013 .

[3]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[4]  Jerome P. Reiter,et al.  Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data , 2017 .

[5]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[6]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[7]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[8]  Donald B. Rubin,et al.  Nested multiple imputation of NMES via partially incompatible MCMC , 2003 .

[9]  Donald B. Rubin,et al.  Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program , 2014 .

[10]  J. Vermunt,et al.  9. Multiple Imputation of Incomplete Categorical Data Using Latent Class Analysis , 2008 .

[11]  Jerome P. Reiter,et al.  Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence , 2014, 1410.0438.

[12]  Yang Yuan,et al.  Multiple Imputation Using SAS Software , 2011 .

[13]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[14]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[15]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[16]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[17]  Roderick J. A. Little,et al.  The NHANES III multiple imputation project , 1996 .

[18]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[19]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[20]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[21]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[22]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[23]  J.P.L. Brand,et al.  Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets , 1999 .

[24]  T. Speed,et al.  Characterizing a joint probability distribution by conditionals , 1993 .

[25]  Trevillore E. Raghunathan,et al.  IVEware: Imputation and Variance Estimation Software User Guide , 2002 .

[26]  Alexander Hehmeyer,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[27]  Kurt Hornik,et al.  The Comprehensive R Archive Network , 2012 .

[28]  Jerome P. Reiter,et al.  Semi-parametric Selection Models for Potentially Non-ignorable Attrition in Panel Studies with Refreshment Samples , 2015, Political Analysis.

[29]  Jerome P. Reiter,et al.  Bayesian Estimation of Discrete Multivariate Latent Structure Models With Structural Zeros , 2014 .

[30]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[31]  B. Arnold,et al.  Compatible Conditional Distributions , 1989 .

[32]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[33]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[34]  Jürgen Unützer,et al.  A comparison of imputation methods in a longitudinal randomized clinical trial , 2005, Statistics in medicine.

[35]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[36]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[37]  Patrick Royston,et al.  Multiple Imputation by Chained Equations (MICE): Implementation in Stata , 2011 .

[38]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.