synthpop: Bespoke Creation of Synthetic Data in R

In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. Synthetic data which mimic the original observed data and preserve the relationships between variables but do not contain any disclosive records are one possible solution to this problem. The synthpop package for R, introduced in this paper, provides routines to generate synthetic versions of original data sets. We describe the methodology and its consequences for the data characteristics. We illustrate the package features using a survey data example.

[1]  M. Elliot,et al.  A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records , 2007 .

[2]  Trevillore E. Raghunathan,et al.  IVEware: Imputation and Variance Estimation Software User Guide , 2002 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[5]  Gillian M. Raab,et al.  Practical Data Synthesis for Large Samples , 2018, J. Priv. Confidentiality.

[6]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[7]  Dermot O'Reilly,et al.  Cohort description: the Northern Ireland Longitudinal Study (NILS). , 2012, International journal of epidemiology.

[8]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[11]  John M. Abowd,et al.  New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers , 2004, Privacy in Statistical Databases.

[12]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[13]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[14]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[15]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[16]  Peter Filzmoser,et al.  Simulation of close-to-reality population data for household surveys with application to EU-SILC , 2011, Stat. Methods Appl..

[17]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[18]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[19]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[20]  Peteke Feijten,et al.  Cohort Profile: the Scottish Longitudinal Study (SLS). , 2009, International journal of epidemiology.

[21]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[22]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[23]  Jerome P. Reiter,et al.  Model Selection when multiple imputation is used to protect confidentiality in public use data , 2011, J. Priv. Confidentiality.

[24]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[25]  Jörg Drechsler,et al.  Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation , 2011 .

[26]  Chris Dibben,et al.  A simplified approach to generating synthetic data for disclosure control , 2014, 1409.0217.

[27]  Jerome P. Reiter,et al.  Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary , 2012 .