Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata

Many statistical agencies disseminate samples of census microdata, that is, data on individual records, to the public. Before releasing the microdata, agencies typically alter identifying or sensitive values to protect data subjects’ confidentiality, for example by coarsening, perturbing, or swapping data. These standard disclosure limitation techniques distort relationships and distributional features in the original data, especially when applied with high intensity. Furthermore, it can be difficult for analysts of the masked public use data to adjust inferences for the effects of the disclosure limitation. Motivated by these shortcomings, we propose an approach to census microdata dissemination called sampling with synthesis. The basic idea is to replace the identifying or sensitive values in the census with multiple imputations, and release samples from these multiply-imputed populations. We demonstrate that sampling with synthesis can improve the quality of public use data relative to sampling followed by standard statistical disclosure limitation; simulation results showing this are available online as supplemental material. We derive methods for analyzing the multiple datasets generated by sampling with synthesis. We present algorithms for selecting which census values to synthesize based on considerations of disclosure risk and data utility. We illustrate sampling with synthesis on a population constructed with data from the U.S. Current Population Survey.

[1]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[2]  Julia Lane,et al.  Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances , 2006, Privacy in Statistical Databases.

[3]  Jörg Drechsler,et al.  Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data , 2008, Privacy in Statistical Databases.

[4]  Jerome P. Reiter,et al.  Making public use , synthetic files of the Longitudinal Business Database , 2022 .

[5]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[6]  Lynn A Blewett,et al.  Distributing State Children's Health Insurance Program funds: a critical review of the design and implementation of the funding formula. , 2007, Journal of health politics, policy and law.

[7]  Roderick J. A. Little,et al.  Multiple imputation: an alternative to top coding for statistical disclosure control , 2007 .

[8]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[9]  Nathaniel Schenker,et al.  Combining information from multiple surveys to enhance estimation of measures of health , 2007, Statistics in medicine.

[10]  Thomas Zwick,et al.  A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access , 2008 .

[11]  Richard Penny,et al.  Multiply Imputed Synthetic Data Files , 2007 .

[12]  Jerome P. Reiter Multiple imputation when records used for imputation are not used or disseminated for analysis , 2008 .

[13]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[14]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[15]  Raul Cano On The Bayesian Bootstrap , 1992 .

[16]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[17]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[18]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[19]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[20]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[21]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[22]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[23]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[24]  John M. Abowd,et al.  Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project , 2006 .

[25]  Jerome P. Reiter,et al.  Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data , 2006, Privacy in Statistical Databases.

[26]  Michael J. Davern,et al.  Inaccurate Age and Sex Data in the Census Pums Files: Evidence and Implications , 2010, SSRN Electronic Journal.

[27]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[28]  Keying Ye,et al.  Applied Bayesian Modeling and Causal Inference From Incomplete-Data Perspectives , 2005, Technometrics.

[29]  Anna Oganian,et al.  A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality , 2006 .

[30]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[31]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[32]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[33]  Fang Liu,et al.  Statistical Disclosure Techniques Based on Multiple Imputation , 2005 .

[34]  A. Kennickell Multiple Imputation and Disclosure Protection : TheCase of the 1995 Survey of Consumer Finances , 2000 .