Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

ABSTRACT In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a 3-year-old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully use information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online.

[1]  Gabriele B. Durrant,et al.  Using missing data methods to correct for measurement error in a distribution function , 2006 .

[2]  R. Merrill,et al.  Hysterectomy surveillance in the United States, 1997 through 2005. , 2008, Medical science monitor : international medical journal of experimental and clinical research.

[3]  Alexander Kukush,et al.  Measurement Error Models , 2011, International Encyclopedia of Statistical Science.

[4]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[5]  Jerome P. Reiter,et al.  Bayesian Estimation of Discrete Multivariate Latent Structure Models With Structural Zeros , 2014 .

[6]  Jerome P. Reiter,et al.  Simultaneous Edit-Imputation for Continuous Microdata , 2015 .

[7]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[8]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[9]  Thomas Brendan Murphy,et al.  BayesLCA: An R Package for Bayesian Latent Class Analysis , 2014 .

[10]  Alexander Hehmeyer,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[11]  Robert M. Groves,et al.  Total Survey Error: Past, Present, and Future , 2010 .

[12]  Sander Scholtus,et al.  Handbook of Statistical Data Editing and Imputation , 2011 .

[13]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[14]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[15]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[16]  William E. Winkler,et al.  THE DISCRETE EDIT SYSTEM , 1997 .

[17]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[18]  R. Little,et al.  Editing and Imputation for Quantitative Survey Data , 1987 .

[19]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[20]  J. Schafer,et al.  Multiple Edit/Multiple Imputation for Multivariate Continuous Data , 2003 .

[21]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[22]  Jerome P. Reiter,et al.  Bayesian multiple imputation for large-scale categorical data with structural zeros , 2013 .

[23]  P. Biemer Total Survey Error: Design, Implementation, and Evaluation , 2010 .

[24]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[25]  Sander Scholtus,et al.  Handbook of Statistical Data Editing and Imputation , 2011 .

[26]  John G. Kovar,et al.  Editing of Survey Data: How Much Is Enough? , 1997 .

[27]  Gary King,et al.  A Unified Approach to Measurement Error and Missing Data: Overview and Applications , 2017 .

[28]  Steven Ruggles,et al.  Integrated Public Use Microdata Series: Version 3 , 2003 .

[29]  Donald B. Rubin,et al.  Inference from Coarse Data via Multiple Imputation with Application to Age Heaping , 1990 .

[30]  Anders Norberg Editing at Statistics Sweden - Yesterday, today and tomorrow , 2009 .

[31]  Nathaniel Schenker,et al.  Multiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files. , 2007, Paediatric and perinatal epidemiology.

[32]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2013, AISTATS.

[33]  Sander Greenland,et al.  Multiple-imputation for measurement-error correction. , 2006, International journal of epidemiology.

[34]  F. Krauss Latent Structure Analysis , 1980 .

[35]  Xiao-Li Meng,et al.  Single observation unbiased priors , 2002 .

[36]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[37]  William E. Winkler EDITING DISCRETE DATA , 1997 .

[38]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .