PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data

This paper proposes a categorical data synthesizer algorithm that guarantees a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler (PeGS), can handle high-dimensional categorical data that are intractable if represented as contingency tables. PeGS involves three intuitive steps: 1) disintegration, 2) noise injection, and 3) synthesis. We first disintegrate the original data into building blocks that (approximately) capture essential statistical characteristics of the original data. This process is efficiently implemented using feature hashing and non-parametric distribution approximation. In the next step, an optimal amount of noise is injected into the estimated statistical building blocks to guarantee differential privacy or l-diversity. Finally, synthetic samples are drawn using a Gibbs sampler approach. California Patient Discharge data are used to demonstrate statistical properties of the proposed synthetic methodology. Marginal and conditional distributions as well as regression coefficients obtained from the synthesized data are compared to those obtained from the original data. Intruder scenarios are simulated to evaluate disclosure risks of the synthesized data from multiple angles. Limitations and extensions of the proposed algorithm are also discussed.

[1]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[2]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[3]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[4]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[5]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[6]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  A. Dale,et al.  Proposals for 2001 samples of anonymized records: An assessment of disclosure risk , 2001 .

[10]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[11]  Joseph S. Lombardo,et al.  A method for generation and distribution of synthetic medical record data for evaluation of disease-monitoring systems , 2008 .

[12]  George T. Duncan,et al.  Disclosure-Limited Data Dissemination , 1986 .

[13]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[14]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[15]  Anna L. Buczak,et al.  Data-driven approach for creating synthetic electronic medical records , 2010, BMC Medical Informatics Decis. Mak..

[16]  000 001 , 2022 .

[17]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[18]  Luisa Franconi,et al.  A model-based method for disclosure limitation of business microdata , 2002 .

[19]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[20]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[21]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[22]  Silvia Polettini,et al.  Maximum entropy simulation for microdata protection , 2003, Stat. Comput..

[23]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[24]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..

[25]  Joydeep Ghosh,et al.  A distributed learning framework for heterogeneous data sources , 2005, KDD '05.

[26]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[27]  Joseph W. Sakshaug,et al.  Synthetic Data for Small Area Estimation , 2010, Privacy in Statistical Databases.

[28]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[29]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[30]  A. V. Sriharsha,et al.  On Syntactic Anonymity and Differential Privacy , 2015 .

[31]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[32]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[33]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[34]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[35]  Jerome P. Reiter,et al.  Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality , 2007 .

[36]  Joydeep Ghosh,et al.  Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[37]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[38]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[39]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[40]  W. Winkler General Discrete-data Modeling Methods for Producing Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties , 2009 .

[41]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.

[42]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .

[43]  Stephen E. Fienberg,et al.  Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications , 2012, J. Priv. Confidentiality.

[44]  Anne-Sophie Charest Empirical Evaluation of Statistical Inference from Differentially-Private Contingency Tables , 2012, Privacy in Statistical Databases.

[45]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[46]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[47]  Stephen E. Fienberg,et al.  Differential Privacy and the Risk-Utility Tradeoff for Multi-dimensional Contingency Tables , 2010, Privacy in Statistical Databases.

[48]  Jilles Vreeken,et al.  Filling in the Blanks - Krimp Minimisation for Missing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[49]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[50]  Jörg Drechsler,et al.  Evaluating the Potential of Differential Privacy Mechanisms for Census Data , 2013 .

[51]  W. Winkler,et al.  A Contingency-Table Model for Imputing Data Satisfying Analytic Constraints , 2002 .