Differential Correct Attribution Probability for Synthetic Data: An Exploration

Synthetic data generation has been proposed as a flexible alternative to more traditional statistical disclosure control (SDC) methods for limiting disclosure risk. Synthetic data generation is functionally distinct from standard SDC methods in that it breaks the link between the data subjects and the data such that reidentification is no longer meaningful. Therefore orthodox measures of disclosure risk assessment - which are based on reidentification - are not applicable. Research into developing disclosure assessment measures specifically for synthetic data has been relatively limited. In this paper, we develop a method called Differential Correct Attribution Probability (DCAP). Using DCAP, we explore the effect of multiple imputation on the disclosure risk of synthetic data.

[1]  Gillian M. Raab,et al.  Practical Data Synthesis for Large Samples , 2018, J. Priv. Confidentiality.

[2]  Rupert W. Ford,et al.  A Computational Algorithm for Handling the Special Uniques Problem , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[3]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[4]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Mark Elliot,et al.  A Measure of Disclosure Risk for Tables of Counts , 2008, Trans. Data Priv..

[6]  Mark Elliot,et al.  A Genetic Algorithm Approach to Synthetic Data Production , 2016, PrAISe@ECAI.

[7]  L. Cox Statistical Disclosure Limitation , 2006 .

[8]  Mark Elliot,et al.  End User Licence to Open Government Data? A Simulated Penetration Attack on Two Social Survey Datasets , 2016 .

[9]  K. O’Hara,et al.  The Anonymisation Decision-Making Framework , 2016 .

[10]  Jörg Drechsler,et al.  Using Support Vector Machines for Generating Synthetic Datasets , 2010, Privacy in Statistical Databases.

[11]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[12]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[13]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[14]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[15]  C. Skinner,et al.  A measure of disclosure risk for microdata , 2002 .

[16]  Jerome P. Reiter,et al.  Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data , 2014, J. Priv. Confidentiality.

[17]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[18]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[19]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[20]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.