Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. The Dirichlet process prior distributions enable analysts to avoid fixing the number of mixture components at an arbitrary number. We illustrate repeated sampling properties of the approach using simulated data. We apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.

[1]  Yajuan Si,et al.  Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies , 2012 .

[2]  Michael A. West,et al.  Computing Nonparametric Hierarchical Models , 1998 .

[3]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[4]  Matthias von Davier,et al.  A general diagnostic model applied to language testing data. , 2008, The British journal of mathematical and statistical psychology.

[5]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[6]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[7]  Neil Henry Latent structure analysis , 1969 .

[8]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[9]  Yulei He,et al.  Gaussian-based routines to impute categorical variables in health surveys. , 2011, Statistics in medicine.

[10]  Russell V. Lenth,et al.  Statistical Analysis With Missing Data (2nd ed.) (Book) , 2004 .

[11]  J. Vermunt,et al.  9. Multiple Imputation of Incomplete Categorical Data Using Latent Class Analysis , 2008 .

[12]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[13]  Geert Verbeke,et al.  Multiple Imputation for Model Checking: Completed‐Data Plots with Missing and Latent Data , 2005, Biometrics.

[14]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[15]  Joseph L Schafer,et al.  Robustness of a multivariate normal approximation for imputation of incomplete binary data , 2007, Statistics in medicine.

[16]  C. Ake Rounding After Multiple Imputation With Non-binary Categorical Covariates , 2005 .

[17]  S. Fienberg,et al.  Alternative statistical models and representations for large sparse multi-dimensional contingency tables (∗) , 2002 .

[18]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[19]  George Casella,et al.  Estimation in Dirichlet random effects models , 2010, 1002.4756.

[20]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[21]  Mulugeta Gebregziabher,et al.  Latent class based multiple imputation approach for missing categorical data. , 2010, Journal of statistical planning and inference.

[22]  David B Dunson,et al.  Nonparametric Bayesian models through probit stick-breaking processes. , 2011, Bayesian analysis.

[23]  S. Sinharay,et al.  An Importance Sampling EM Algorithm for Latent Regression Models , 2007 .

[24]  W. Holmes FinchMaria E. Hernández Finch Imputation Methods for Missing Categorical Questionnaire Data: A Comparison of Approaches , 2021, Journal of Data Science.

[25]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[26]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[27]  O. Papaspiliopoulos A note on posterior sampling from Dirichlet mixture models , 2008 .

[28]  F. Krauss Latent Structure Analysis , 1980 .

[29]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[30]  Robert J. Mislevy,et al.  Estimating Population Characteristics From Sparse Matrix Samples of Item Responses , 1992 .

[31]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[32]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[33]  P. Allison Multiple Imputation for Missing Data , 2000 .

[34]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[35]  M. Davier Hierarchical mixtures of diagnostic models , 2010 .

[36]  Jayaran Sethuramant A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[37]  D. Rubin,et al.  Large-sample significance levels from multiply imputed data using moment-based statistics and an F reference distribution , 1991 .

[38]  J. Schafer,et al.  On the performance of multiple imputation for multivariate data with small sample size , 1999 .

[39]  Samantha R. Cook,et al.  Multiple Imputation in the Anthrax Vaccine Research Program , 2010 .

[40]  Keisuke Hirano,et al.  Semiparametric Bayesian Inference in Autoregressive Panel Data Models , 2002 .

[41]  N. Thomas,et al.  The role of secondary covariates when estimating latent trait population distributions , 2002 .

[42]  Stochastic Approximation Methods for Latent Regression Item Response Models , 2010 .

[43]  Yaming Yu,et al.  Imputing Missing Data by Fully Conditional Models : Some Cautionary Examples and Guidelines , 2012 .

[44]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[45]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[46]  Paul Wright,et al.  Controlling for Student Background in Value-Added Assessment of Teachers , 2004 .

[47]  Donald B. Rubin,et al.  Performing likelihood ratio tests with multiply-imputed data sets , 1992 .

[48]  A. Zaslavsky,et al.  Multiple imputation in a large-scale complex survey: a practical guide , 2010, Statistical methods in medical research.

[49]  T. Ferguson Prior Distributions on Spaces of Probability Measures , 1974 .

[50]  Jerome P. Reiter,et al.  Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data , 2007 .

[51]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[52]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[53]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[54]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[55]  Siddhartha Chib,et al.  Semiparametric Bayes analysis of longitudinal data treatment models , 2002 .

[56]  Stephen G. Walker,et al.  Sampling the Dirichlet Mixture Model with Slices , 2006, Commun. Stat. Simul. Comput..