Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence

ABSTRACT We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (1) modeling the means of the normal distributions as component-specific functions of the categorical variables and (2) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. Supplementary materials for this article are available online.

[1]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[2]  A. Gelman,et al.  ON THE STATIONARY DISTRIBUTION OF ITERATIVE IMPUTATIONS , 2010, 1012.2902.

[3]  Jerome P. Reiter,et al.  Semi-parametric Selection Models for Potentially Non-ignorable Attrition in Panel Studies with Refreshment Samples , 2015, Political Analysis.

[4]  Warren B. Powell,et al.  Dirichlet Process Mixtures of Generalized Linear Models , 2009, J. Mach. Learn. Res..

[5]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[6]  Alexander Hehmeyer,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[7]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[8]  Andrew Gelman,et al.  Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches , 2014, Political Analysis.

[9]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[10]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[11]  Michael R. Elliott,et al.  Using a mixture model for multiple imputation in the presence of outliers: the ‘Healthy for life’ project , 2007 .

[12]  Joseph G. Ibrahim,et al.  Missing covariates in generalized linear models when the missing data mechanism is non‐ignorable , 1999 .

[13]  David B. Dunson,et al.  Bayesian learning of joint distributions of objects , 2013, AISTATS.

[14]  David B. Dunson,et al.  Nonparametric Bayes regression and classification through mixtures of product kernels , 2010 .

[15]  Steven G. Heeringa Multivariate imputation of coarsened survey data on household wealth. , 2000 .

[16]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[17]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[18]  Mulugeta Gebregziabher,et al.  Latent class based multiple imputation approach for missing categorical data. , 2010, Journal of statistical planning and inference.

[19]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[20]  Marco Di Zio,et al.  Imputation through finite Gaussian mixture models , 2007, Comput. Stat. Data Anal..

[21]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[22]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[23]  John B Carlin,et al.  Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. , 2010, American journal of epidemiology.

[24]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[25]  D. Rubin,et al.  Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data , 1998 .

[26]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[27]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[28]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[29]  Babak Shahbaba,et al.  Nonlinear Models Using Dirichlet Process Mixtures , 2007, J. Mach. Learn. Res..

[30]  Jerome P. Reiter,et al.  Bayesian multiple imputation for large-scale categorical data with structural zeros , 2013 .

[31]  Sonia Petrone,et al.  An enriched conjugate prior for Bayesian nonparametric inference , 2011 .

[32]  S. MacEachern,et al.  An ANOVA Model for Dependent Random Measures , 2004 .

[33]  J. Vermunt,et al.  9. Multiple Imputation of Incomplete Categorical Data Using Latent Class Analysis , 2008 .

[34]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[35]  Jerome P. Reiter,et al.  Bayesian Estimation of Discrete Multivariate Latent Structure Models With Structural Zeros , 2014 .

[36]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[37]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[38]  Paul A. Herzberg Sources of Input , 1990 .

[39]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[40]  P. Patrician Multiple imputation for missing data. , 2002, Research in nursing & health.

[41]  Joseph G. Ibrahim,et al.  A conditional model for incomplete covariates in parametric regression models , 1996 .

[42]  Sylvia Richardson,et al.  Bayesian profile regression with an application to the National Survey of Children's Health. , 2010, Biostatistics.

[43]  Ingram Olkin,et al.  Multivariate Correlation Models with Mixed Discrete and Continuous Variables , 1961 .

[44]  Jerome P. Reiter,et al.  Multiple Imputation of Missing or Faulty Values Under Linear Constraints , 2014 .

[45]  Jerome P. Reiter,et al.  Tests of multivariate hypotheses when using multiple imputation for missing data and disclosure limitation , 2010 .