Comparative Study of Differentially Private Data Synthesis Methods

When sharing data among researchers or releasing data for public use, there is a risk of exposing sensitive information of individuals in the data set. Data synthesis (DS) is a statistical disclosure limitation technique for releasing synthetic data sets with pseudo individual records. Traditional DS techniques often rely on strong assumptions of a data intruder's behaviors and background knowledge to assess disclosure risk. Differential privacy (DP) formulates a theoretical approach for a strong and robust privacy guarantee in data release without having to model intruders' behaviors. Efforts have been made aiming to incorporate the DP concept in the DS process. In this paper, we examine current DIfferentially Private Data Synthesis (DIPS) techniques for releasing individual-level surrogate data for the original data, compare the techniques conceptually, and evaluate the statistical utility and inferential properties of the synthetic data via each DIPS technique through extensive simulation studies. Our work sheds light on the practical feasibility and utility of the various DIPS approaches, and suggests future research directions for DIPS.

[1]  Ashwin Machanavajjhala,et al.  Pythia: Data Dependent Differentially Private Algorithm Selection , 2017, SIGMOD Conference.

[2]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[3]  Jianer Chen,et al.  Theory and Applications of Models of Computation , 2014, Lecture Notes in Computer Science.

[4]  Ashwin Machanavajjhala,et al.  Differentially Private Significance Tests for Regression Coefficients , 2017, Journal of Computational and Graphical Statistics.

[5]  Ashwin Machanavajjhala,et al.  Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government , 2017, The Annals of Applied Statistics.

[6]  Ashwin Machanavajjhala,et al.  Principled Evaluation of Differentially Private Algorithms using DPBench , 2015, SIGMOD Conference.

[7]  Thomas Steinke,et al.  Between Pure and Approximate Differential Privacy , 2015, J. Priv. Confidentiality.

[8]  Daniel Kifer,et al.  Private Convex Empirical Risk Minimization and High-dimensional Regression , 2012, COLT 2012.

[9]  Mohamed Ali Kâafar,et al.  A differential privacy framework for matrix factorization recommender systems , 2016, User Modeling and User-Adapted Interaction.

[10]  Xiaoqian Jiang,et al.  SHARE: system design and case studies for statistical health information release , 2013, J. Am. Medical Informatics Assoc..

[11]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[12]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[13]  Ninghui Li,et al.  Differential Privacy: From Theory to Practice , 2016, Differential Privacy.

[14]  Chris Clifton,et al.  How Much Is Enough? Choosing ε for Differential Privacy , 2011, ISC.

[15]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[16]  Lars Vilhuber,et al.  Differential Privacy Applications to Bayesian and Linear Mixed Model Estimation , 2013, J. Priv. Confidentiality.

[17]  Claude Castelluccia,et al.  Differentially Private Histogram Publishing through Lossy Compression , 2012, 2012 IEEE 12th International Conference on Data Mining.

[18]  Yehuda Lindell,et al.  Tutorials on the Foundations of Cryptography , 2017 .

[19]  Adrian E. Raftery,et al.  Bayesian Model Averaging: A Tutorial , 2016 .

[20]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[21]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[22]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[23]  Úlfar Erlingsson,et al.  Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries , 2015, Proc. Priv. Enhancing Technol..

[24]  Anne-Sophie Charest,et al.  On the Meaning and Limits of Empirical Differential Privacy , 2016, J. Priv. Confidentiality.

[25]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[26]  Yin Yang,et al.  Functional Mechanism: Regression Analysis under Differential Privacy , 2012, Proc. VLDB Endow..

[27]  L. Wasserman,et al.  A Statistical Framework for Differential Privacy , 2008, 0811.2501.

[28]  Benjamin C. M. Fung,et al.  Privacy-preserving trajectory data publishing by local suppression , 2013, Inf. Sci..

[29]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[30]  Aaron Roth,et al.  Iterative Constructions and Private Data Release , 2011, TCC.

[31]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[32]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[33]  Li Xiong,et al.  Protecting Locations with Differential Privacy under Temporal Correlations , 2014, CCS.

[34]  Jerome P. Reiter,et al.  Estimating Identification Disclosure Risk Using Mixed Membership Models , 2012, Journal of the American Statistical Association.

[35]  Jing Lei,et al.  Differentially private model selection with penalized and constrained likelihood , 2016, 1607.04204.

[36]  Adam D. Smith Differentially Private Model Selection via Stability Arguments and the Robustness of the Lasso , 2013 .

[37]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[38]  Alexander J. Smola,et al.  Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo , 2015, ICML.

[39]  Guy N. Rothblum,et al.  Concentrated Differential Privacy , 2016, ArXiv.

[40]  G. Robin Henderson,et al.  Bounds for the Sample Standard Deviation , 1984 .

[41]  Yin Yang,et al.  PrivGene: differentially private model fitting using genetic algorithms , 2013, SIGMOD '13.

[42]  Fang Liu Noninformative Bounding in Differential Privacy and Its Impact on Statistical Properties of Sanitized Results in Truncated and Boundary-Inflated-Truncated Laplace Mechanisms , 2016, ArXiv.

[43]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[44]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[45]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[46]  Yan Zhang,et al.  RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[47]  David Gil Méndez,et al.  Predicting seminal quality with artificial intelligence methods , 2012, Expert Syst. Appl..

[48]  Aleksandra B. Slavkovic,et al.  Sharing social network data: differentially private estimation of exponential family random‐graph models , 2015, ArXiv.

[49]  Jerome P. Reiter,et al.  Using Multiple Imputation to Integrate and Disseminate Confidential Microdata , 2009 .

[50]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[51]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[52]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[53]  Isaac Dialsingh,et al.  Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspectives , 2005 .

[54]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[55]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[56]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[57]  F. Liu,et al.  STatistical Election to Partition Sequentially (STEPS) and Its Application in Differentially Private Release and Analysis of Youth Voter Registration Data , 2018, 1803.06763.

[58]  Divesh Srivastava,et al.  DPT: Differentially Private Trajectory Synthesis Using Hierarchical Reference Systems , 2015, Proc. VLDB Endow..

[59]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[60]  F. Liu Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanisms with Noninformative Bounding , 2016, 1607.08554.

[61]  Guy N. Rothblum,et al.  A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[62]  Kunal Talwar,et al.  On the geometry of differential privacy , 2009, STOC '10.

[63]  Marianne Winslett,et al.  Differentially private data cubes: optimizing noise sources and consistency , 2011, SIGMOD '11.

[64]  Tal Malkin,et al.  Hardness of Non-Interactive Differential Privacy from One-Way Functions , 2017, IACR Cryptol. ePrint Arch..

[65]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[66]  Anand D. Sarwate,et al.  Near-optimal Differentially Private Principal Components , 2012, NIPS.

[67]  Salil P. Vadhan,et al.  The Complexity of Differential Privacy , 2017, Tutorials on the Foundations of Cryptography.

[68]  Sampath Kannan,et al.  Privacy-Preserving Data Analysis for the Federal Statistical Agencies , 2017, ArXiv.

[69]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[70]  Xiaoqian Jiang,et al.  Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions , 2014, EDBT.

[71]  Or Sheffet,et al.  Differentially Private Ordinary Least Squares , 2015, ICML.

[72]  Fang Liu,et al.  Model-based Differentially Private Data Synthesis and Statistical Inference in Multiple Synthetic Datasets , 2016, Trans. Data Priv..

[73]  Jun Tang,et al.  Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 , 2017, ArXiv.

[74]  Larry A. Wasserman,et al.  Differential privacy for functions and functional data , 2012, J. Mach. Learn. Res..

[75]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[76]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control , 2012 .

[77]  Sharon Goldberg,et al.  A workflow for differentially-private graph synthesis , 2012, WOSN '12.

[78]  Ninghui Li,et al.  Understanding Hierarchical Methods for Differentially Private Histograms , 2013, Proc. VLDB Endow..

[79]  Adam D. Smith,et al.  Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso , 2013, COLT.

[80]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[81]  Jouni Kerman,et al.  Neutral noninformative and informative conjugate beta and gamma prior distributions , 2011 .

[82]  Kobbi Nissim,et al.  On the Generalization Properties of Differential Privacy , 2015, ArXiv.

[83]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[84]  Tim Roughgarden,et al.  Interactive privacy via the median mechanism , 2009, STOC '10.

[85]  Li Xiong,et al.  DPCube: Releasing Differentially Private Data Cubes for Health Information , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[86]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[87]  Jerome P. Reiter,et al.  Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary , 2012 .

[88]  Ian M. Schmutte,et al.  Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods , 2015 .

[89]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[90]  Ling Huang,et al.  Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning , 2009, J. Priv. Confidentiality.

[91]  Marco Gaboardi,et al.  PSI (Ψ): a Private data Sharing Interface , 2016, ArXiv.

[92]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[93]  Ashwin Machanavajjhala,et al.  Publishing Search Logs—A Comparative Study of Privacy Guarantees , 2012, IEEE Transactions on Knowledge and Data Engineering.

[94]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[95]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[96]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[97]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[98]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..