A Baseline for Attribute Disclosure Risk in Synthetic Data

The generation of synthetic data is widely considered as viable method for alleviating privacy concerns and for reducing identification and attribute disclosure risk in micro-data. The records in a synthetic dataset are artificially created and thus do not directly relate to individuals in the original data in terms of a 1-to-1 correspondence. As a result, inferences about said individuals appear to be infeasible and, simultaneously, the utility of the data may be kept at a high level. In this paper, we challenge this belief by interpreting the standard attacker model for attribute disclosure as classification problem. We show how disclosure risk measures presented in recent publications may be compared to or even be reformulated as machine learning classification models. Our overall goal is to empirically analyze attribute disclosure risk in synthetic data and to discuss its close relationship to data utility. Moreover, we improve the baseline for attribute disclosure risk from the attacker's perspective by applying variants of the RadiusNearestNeighbor and the EnsembleVote classifier.

[1]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Józef Drewniak,et al.  On a Class of Uninorms , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[3]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[4]  Jerome P. Reiter,et al.  Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data , 2014, J. Priv. Confidentiality.

[5]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[6]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[8]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[9]  H Surendra,et al.  A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing , 2017 .

[10]  Andreas Ekelhart,et al.  On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks , 2019, ARES.

[11]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[12]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[13]  Bill Howe,et al.  DataSynthesizer: Privacy-Preserving Synthetic Datasets , 2017, SSDBM.

[14]  Maria Pampaka,et al.  Differential Correct Attribution Probability for Synthetic Data: An Exploration , 2018, PSD.

[15]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[16]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.