Improved Variance Estimation for Fully Synthetic Datasets

Fully synthetic datasets, i.e. datasets that only contain simulated values, arguably provide a very high level of data protection. Since all values are simulated reidentification is almost impossible. This makes the approach especially attractive for the release of very sensitive data such as medical records. However, the established variance estimate for fully synthetic datasets has two major drawbacks. First, it can be positively biased, where the bias is a function of the sampling rate of the original data. Second, it can become negative. In this paper I illustrate the negative effects of these drawbacks on the estimation of the variance and propose an alternative variance estimate that shows less variability, is always unbiased, and can never be negative. This variance estimate is closely related to the variance estimate for partially synthetic datasets.

[1]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[2]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[3]  Mandi Yu Disclosure Risk Assessments and Control. , 2008 .

[4]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[5]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[6]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[7]  P. Graham,et al.  Multiply imputed synthetic data: evaluation of Hierarchical Bayesian imputation models , 2009 .

[8]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[9]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[10]  Jerome P. Reiter,et al.  Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality , 2007 .

[11]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases, From Theory to Practice , 2002 .

[12]  Rathindra Sarathy,et al.  Data Shuffling - A New Masking Approach for Numerical Data , 2006, Manag. Sci..

[13]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[14]  Joseph W. Sakshaug,et al.  Synthetic Data for Small Area Estimation , 2010, Privacy in Statistical Databases.

[15]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[16]  Thomas Zwick,et al.  A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access , 2008 .

[17]  Peter Kooiman,et al.  Post randomisation for statistical disclosure control: Theory and implementation , 1997 .

[18]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[19]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[20]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[21]  Donald B. Rubin,et al.  Multiple imputations in sample surveys , 1978 .

[22]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[23]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .