A simplified approach to generating synthetic data for disclosure control

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data. We make recommendations on how to synthesise data based on these findings. An example of synthesising data from the Scottish Longitudinal Study is included to illustrate our results.

[1]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[2]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[3]  Martin Klein,et al.  Likelihood-based inference for singly and multiply imputed synthetic data under a normal model , 2015 .

[4]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[5]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[6]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[7]  Patrick Royston,et al.  Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables☆ , 2010, Comput. Stat. Data Anal..

[8]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[9]  Jörg Drechsler Improved Variance Estimation for Fully Synthetic Datasets , 2011 .

[10]  Gary Benedetto,et al.  Distribution-Preserving Statistical Disclosure Limitation , 2007, Comput. Stat. Data Anal..

[11]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[12]  Joerg Drechsler,et al.  New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey , 2012 .

[13]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[14]  Jerome P. Reiter,et al.  Tests of multivariate hypotheses when using multiple imputation for missing data and disclosure limitation , 2010 .

[15]  Dermot O'Reilly,et al.  Cohort description: the Northern Ireland Longitudinal Study (NILS). , 2012, International journal of epidemiology.

[16]  Jörg Drechsler,et al.  Disclosure risk and data utility for partially synthetic data: an empirical study using the german IAB establishment survey , 2009 .

[17]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[18]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[19]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[20]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[21]  G. Box Science and Statistics , 1976 .

[22]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[23]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[24]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[25]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[26]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[27]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[28]  B. Arnold,et al.  Conditionally specified distributions: an introduction , 2001 .

[29]  Jerome P. Reiter,et al.  Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary , 2012 .

[30]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[31]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[32]  Martin Klein,et al.  Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models , 2015 .

[33]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[34]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[35]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[36]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[37]  George T. Duncan,et al.  Obtaining Information while Preserving Privacy: A Markov Perturbation Method for Tabular Data , 1997 .

[38]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[39]  Peteke Feijten,et al.  Cohort Profile: the Scottish Longitudinal Study (SLS). , 2009, International journal of epidemiology.