Differentially Private Data Synthesis Methods

When sharing data among researchers or releasing data for public use, there is a risk of exposing sensitive information of individuals who contribute to the data. Data synthesis (DS) is a statistical disclosure limitation technique for releasing synthetic data sets with pseudo individual records. Traditional DS techniques often rely on strong assumptions on a data intruder's behaviors and background knowledge to assess disclosure risk. Differential privacy formulates a theoretical approach for strong and robust privacy guarantee in data release without having to model intruders' behaviors. In recent years, efforts have been made aiming to incorporate the DP concept in the DS process. In this paper, we examine current DIfferentially Private Data Synthesis (dips) techniques, compare the techniques conceptually, and evaluate the statistical utility and inferential properties of the synthetic data via each dips technique through extensive simulation studies. The comparisons and simulation results shed light on the practical feasibility and utility of the various dips approaches, and suggest future research directions for dips.

[1]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[2]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[3]  Guy N. Rothblum,et al.  Concentrated Differential Privacy , 2016, ArXiv.

[4]  Benjamin C. M. Fung,et al.  Privacy-preserving trajectory data publishing by local suppression , 2013, Inf. Sci..

[5]  G. Robin Henderson,et al.  Bounds for the Sample Standard Deviation , 1984 .

[6]  Ashwin Machanavajjhala,et al.  Publishing Search Logs—A Comparative Study of Privacy Guarantees , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Fang Liu Noninformative Bounding in Differential Privacy and Its Impact on Statistical Properties of Sanitized Results in Truncated and Boundary-Inflated-Truncated Laplace Mechanisms , 2016, ArXiv.

[8]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[9]  Daniel Kifer,et al.  Private Convex Empirical Risk Minimization and High-dimensional Regression , 2012, COLT 2012.

[10]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[11]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control , 2012 .

[12]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[13]  Xiaoqian Jiang,et al.  SHARE: system design and case studies for statistical health information release , 2013, J. Am. Medical Informatics Assoc..

[14]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[15]  Aleksandra B. Slavkovic,et al.  Sharing social network data: differentially private estimation of exponential family random‐graph models , 2015, ArXiv.

[16]  Tim Roughgarden,et al.  Interactive privacy via the median mechanism , 2009, STOC '10.

[17]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[18]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[19]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..

[20]  Sampath Kannan,et al.  Privacy-Preserving Data Analysis for the Federal Statistical Agencies , 2017, ArXiv.

[21]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[22]  Fang Liu,et al.  Model-based Differentially Private Data Synthesis and Statistical Inference in Multiple Synthetic Datasets , 2016, Trans. Data Priv..

[23]  Jun Tang,et al.  Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 , 2017, ArXiv.

[24]  Ashwin Machanavajjhala,et al.  Differentially Private Significance Tests for Regression Coefficients , 2017, Journal of Computational and Graphical Statistics.

[25]  Úlfar Erlingsson,et al.  Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries , 2015, Proc. Priv. Enhancing Technol..

[26]  Ling Huang,et al.  Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning , 2009, J. Priv. Confidentiality.

[27]  Marco Gaboardi,et al.  PSI (Ψ): a Private data Sharing Interface , 2016, ArXiv.

[28]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[29]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[31]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[32]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[33]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[34]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[35]  Or Sheffet Differentially Private Ordinary Least Squares: $t$-Values, Confidence Intervals and Rejecting Null-Hypotheses , 2015 .

[36]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[37]  F. Liu Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanisms with Noninformative Bounding , 2016, 1607.08554.

[38]  Adam D. Smith,et al.  Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso , 2013, COLT.

[39]  Jouni Kerman,et al.  Neutral noninformative and informative conjugate beta and gamma prior distributions , 2011 .

[40]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[41]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[42]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[43]  Alexander J. Smola,et al.  Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo , 2015, ICML.