NISS Bayesian Data Editing for Continuous Microdata With Application to the 2007 Census of Manufactures

We present a full Bayesian, joint modeling approach to simultaneous editing and imputation for continuous microdata under the linear constraints. To provide high-quality data, statistical agencies spend large amounts of resources to detect erroneous values in the collected datasets and correct them. Several automatic data editing systems are based on the Fellegi-Holt method which uses a two-stage process of first finding the erroneous data items by logical conditions, called edits, and then imputing new values for those detected items, usually using relatively simple imputation methods such as hot deck imputation or ratio imputation. Our approach replaces the two step process with a single probability based, data-driven approach in which we (i) specify a flexible joint probability model for the continuous variables, which can capture complex associations, (ii) stochastically identify erroneous items, which, unlike Fellegi-Holt routines, can reflect the information of individual records, and (iii) impute new values from the model in ways guaranteed to satisfy all ratio edits as well as balance edits without deriving implied edits. In this paper, we describe this integrated editing approach with simulation study and application to the 2007 U.S. Census of Manufactures data. We compare the approach against the Fellegi-Holt based approaches, showing how the joint model-based approach can offer improved accuracy.

[1]  Guangyu Zhang,et al.  Identifying implausible gestational ages in preterm babies with Bayesian mixture models , 2013, Statistics in medicine.

[2]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[3]  M. West,et al.  A Bayesian method for classification and discrimination , 1992 .

[4]  A. Zaslavsky,et al.  Domain-Level Covariance Analysis for Multilevel Survey Data With Structured Nonresponse , 2008 .

[5]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[6]  J. Schafer,et al.  Multiple Edit/Multiple Imputation for Multivariate Continuous Data , 2003 .

[7]  T. De Waal A Fast and Simple Algorithm for Automatic Editing of Mixed Data , 2003 .

[8]  Donald B. Rubin,et al.  Inference from Coarse Data via Multiple Imputation with Application to Age Heaping , 1990 .

[9]  D. Rubin,et al.  Ignorability and Coarse Data , 1991 .

[10]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[11]  R. Little,et al.  Editing and Imputation for Quantitative Survey Data , 1987 .

[12]  Ton de Waal,et al.  Automatic Editing for Business Surveys: An Assessment of Selected Algorithms , 2005 .

[13]  Juan José Salazar González,et al.  A branch-and-cut algorithm for the continuous error localization problem in data cleaning , 2007, Comput. Oper. Res..

[14]  Sander Scholtus,et al.  Automatic editing with hard and soft edits , 2013 .

[15]  William E. Winkler,et al.  BALANCING AND RATIO EDITING WITH THE NEW SPEER SYSTEM , 2002 .

[16]  Xiao-Li Meng,et al.  Single observation unbiased priors , 2002 .

[17]  William E. Winkler,et al.  SET-COVERING AND EDITING DISCRETE DATA , 1998 .

[18]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[19]  R. S. Garfinkel,et al.  Optimal Imputation of Erroneous Data: Categorical Data, General Edits , 1986, Oper. Res..

[20]  Sander Scholtus,et al.  Handbook of Statistical Data Editing and Imputation , 2011 .

[21]  Natalie Shlomo,et al.  Calibrated imputation of numerical data under linear edit restrictions , 2013, 1401.1663.

[22]  Petros Dellaportas,et al.  Multivariate mixtures of normals with unknown number of components , 2006, Stat. Comput..

[23]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[24]  Brian Greenberg A FLEXIBLE A N D INTERACTIVE EDIT A N D IMPUTATION SYSTEM FOR RATIO EDITS , 2002 .

[25]  Nathaniel Schenker,et al.  Multiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files. , 2007, Paediatric and perinatal epidemiology.

[26]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[27]  Ton de Waal,et al.  Automatic Edit and Imputation for Business Surveys: The Dutch Contribution to the EUREDIT Project , 2005 .

[28]  Katherine J. Thompson,et al.  USING A QUADRATIC PROGRAMMING APPROACH TO SOLVE SIMULTANEOUS RATIO AND BALANCE EDIT PROBLEMS , 2002 .

[29]  Todd A. Todaro Evaluation of the Aggies Automated Edit and Imputation System , 1999 .

[30]  English Only Implied Edit Generation and Error Localization for Ratio and Balancing Edits Supporting Paper , 2003 .

[31]  John G. Kovar,et al.  Editing of Survey Data: How Much Is Enough? , 1997 .

[32]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[33]  Michael Bankier,et al.  ADDITIONAL DETAILS ON IMPUTING NUMERIC AND QUALITATIVE VARIABLES SIMULTANEOUSLY , 2002 .

[34]  William E. Winkler,et al.  APPLICATION OF THE SPEER EDIT SYSTEM , 1997 .

[35]  Jerome P. Reiter,et al.  Multiple Imputation of Missing or Faulty Values Under Linear Constraints , 2014 .