Practical Data Synthesis for Large Samples

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set. We make recommendations on how to synthesise data based on these results. The practical consequences of these results are illustrated with an example from the Scottish Longitudinal Study.

[1]  Juris Breidaks,et al.  Variance Estimation for Sample Surveys by the Ultimate ClusterMethod , 2016 .

[2]  Martin Klein,et al.  Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models , 2015 .

[3]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[4]  Jörg Drechsler Improved Variance Estimation for Fully Synthetic Datasets , 2011 .

[5]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[6]  Gary Benedetto,et al.  The Creation and Use of the SIPP Synthetic Beta v7.0 , 2018 .

[7]  George T. Duncan,et al.  Obtaining Information while Preserving Privacy: A Markov Perturbation Method for Tabular Data , 1997 .

[8]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..

[9]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[10]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[11]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[12]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[13]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[15]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[16]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[17]  Peteke Feijten,et al.  Cohort Profile: the Scottish Longitudinal Study (SLS). , 2009, International journal of epidemiology.

[18]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[19]  Jerome P. Reiter,et al.  Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data , 2014, J. Priv. Confidentiality.

[20]  Jerome P. Reiter,et al.  Assessing disclosure risks for synthetic data with arbitrary intruder knowledge , 2016 .

[21]  Jerome P. Reiter,et al.  Tests of multivariate hypotheses when using multiple imputation for missing data and disclosure limitation , 2010 .

[22]  Anna Oganian,et al.  Verification servers: Enabling analysts to assess the quality of inferences from public use data , 2009, Comput. Stat. Data Anal..

[23]  Gary Benedetto,et al.  Disclosure Review Board Memo: Request for Release of SIPP Synthetic Beta Version 5.1 , 2012 .

[24]  G. Box Science and Statistics , 1976 .

[25]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[26]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[27]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[28]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[29]  Nicolas Kim The Effect of Data Swapping on Analyses of American Community Survey Data , 2015, J. Priv. Confidentiality.

[30]  Jerome P. Reiter,et al.  Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data , 2006, Privacy in Statistical Databases.

[31]  Joseph W. Sakshaug,et al.  GENERATING SYNTHETIC MICRODATA TO ESTIMATE SMALL AREA STATISTICS IN THE AMERICAN COMMUNITY SURVEY , 2014 .

[32]  Jörg Drechsler,et al.  Disclosure risk and data utility for partially synthetic data: an empirical study using the german IAB establishment survey , 2009 .

[33]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[34]  Dermot O'Reilly,et al.  Cohort description: the Northern Ireland Longitudinal Study (NILS). , 2012, International journal of epidemiology.

[35]  Ramayya Krishnan,et al.  Disclosure Limitation Methods and Information Loss for Tabular Data , 2001 .

[36]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[37]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[38]  Jörg Drechsler,et al.  Generating synthetic geocoding information for public release , 2015 .

[39]  Roderick J. A. Little,et al.  Calibrated Bayes, an inferential paradigm for official statistics in the era of big data , 2015 .

[40]  C. Skinner Statistical Disclosure Risk: Separating Potential and Harm , 2012 .

[41]  Stephen E. Fienberg,et al.  Modelling User Uncertainty for Disclosure Risk and Data Utility , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[42]  Joerg Drechsler,et al.  New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey , 2012 .

[43]  Martin Klein,et al.  Likelihood-based inference for singly and multiply imputed synthetic data under a normal model , 2015 .

[44]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[45]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[46]  Thomas Lumley,et al.  Complex Surveys: A Guide to Analysis Using R , 2010 .

[47]  Jerome P. Reiter,et al.  Towards Providing Automated Feedback on the Quality of Inferences from Synthetic Datasets , 2012, J. Priv. Confidentiality.

[48]  Bimal K. Sinha,et al.  Likelihood Based Finite Sample Inference for Singly Imputed Synthetic Data Under the Multivariate Normal and Multiple Linear Regression Models , 2015, J. Priv. Confidentiality.

[49]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[50]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[51]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[52]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[53]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[54]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[55]  Joshua Snoke,et al.  General and specific utility measures for synthetic data , 2016, 1604.06651.

[56]  Jerome P. Reiter,et al.  Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary , 2012 .

[57]  Jerome P. Reiter,et al.  A Comparison of Posterior Simulation and Inference by Combining Rules for Multiple Imputation , 2011 .

[58]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[59]  Juan José SALAZAR-GONZÁLEZ,et al.  Statistical Confidentiality: Principles and Practice , 2011 .

[60]  Gary Benedetto,et al.  Distribution-Preserving Statistical Disclosure Limitation , 2007, Comput. Stat. Data Anal..

[61]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[62]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .