Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data

Agencies seeking to disseminate public use microdata, i.e., data on individual records, can replace confidential values with multiple draws from statistical models estimated with the collected data. We present a famework for evaluating disclosure risks inherent in releasing multiply-imputed, synthetic data. The basic idea is to mimic an intruder who computes posterior distributions of confidential values given the released synthetic data and prior knowledge. We illustrate the methodology with artificial fully synthetic data and with partial synthesis of the Survey of Youth in Custody.

[1]  Jerome P. Reiter,et al.  Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data , 2006, Privacy in Statistical Databases.

[2]  Andrew Gelman,et al.  Applied Bayesian Modeling And Causal Inference From Incomplete-Data Perspectives , 2005 .

[3]  C. Skinner Statistical Disclosure Risk: Separating Potential and Harm , 2012 .

[4]  Julia Lane,et al.  Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances , 2006, Privacy in Statistical Databases.

[5]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[6]  Roderick J. A. Little,et al.  Multiple imputation: an alternative to top coding for statistical disclosure control , 2007 .

[7]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .

[8]  Aleksandra Slavkovic,et al.  Synthetic two-way contingency tables that preserve conditional frequencies , 2010 .

[9]  Jörg Drechsler,et al.  Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data , 2008, Privacy in Statistical Databases.

[10]  Jerome P. Reiter,et al.  MULTIPLE IMPUTATION FOR SHARING PRECISE GEOGRAPHIES IN PUBLIC USE DATA. , 2012, The annals of applied statistics.

[11]  K. F. Siler [Blank page] , 2013, 2013 Symposium on VLSI Technology.

[12]  Richard Penny,et al.  Multiply Imputed Synthetic Data Files , 2007 .

[13]  Jerome P. Reiter,et al.  Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary , 2012 .

[14]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[15]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..

[16]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[17]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[18]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[19]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Aleksandra B. Slavkovic,et al.  Differentially Private Synthetic graphs , 2012, ArXiv.

[21]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[22]  John M. Abowd,et al.  Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project , 2006 .

[23]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[24]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[25]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[26]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[27]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[28]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[29]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[30]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[31]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[32]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[33]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[34]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[35]  Thomas Zwick,et al.  A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access , 2008 .

[36]  M. Elliot,et al.  A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records , 2007 .

[37]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[38]  A. Kennickell Multiple Imputation and Disclosure Protection : TheCase of the 1995 Survey of Consumer Finances , 2000 .

[39]  Jerome P. Reiter,et al.  Using Multiple Imputation to Integrate and Disseminate Confidential Microdata , 2009 .

[40]  P. Graham,et al.  Multiply imputed synthetic data: evaluation of Hierarchical Bayesian imputation models , 2009 .

[41]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[42]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[43]  Fang Liu,et al.  Statistical Disclosure Techniques Based on Multiple Imputation , 2005 .

[44]  Lane F Burgette,et al.  Multiple-Shrinkage Multinomial Probit Models with Applications to Simulating Geographies in Public Use Data. , 2013, Bayesian analysis.

[45]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .