Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary

To limit the risks of disclosures when releasing public use data on individual records, statistical agencies and other data disseminators can release multiply-imputed, partially synthetic data (Little, 1993; Reiter, 2003). These comprise the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of quasi-identifiers, replaced with multiple imputations. Partially synthetic data can protect confidentiality, since identification of units and their sensitive data can be difficult when select values in the released data are not actual, collected values. And, with appropriate estimation methods based on the concepts of multiple imputation (Rubin, 1987), they enable data users to make valid inferences for a variety of estimands using standard, complete-data statistical methods and software. Because of these appealing features, partially synthetic data products have been developed for several major data sources in the U.S., including the Longitudinal Business Database (Kinney et al., 2011), the Survey of Income and Program Participation (Abowd et al., 2006), the American Community Survey group quarters data (Hawala, 2008), and the OnTheMap database of where people live and work (Machanavajjhala et al., 2008). Other examples of partially synthetic data are described in Abowd and Woodcock (2004), Little et al. (2004), Drechsler et al. (2008), and Drechsler and Reiter (2010). In the statistical theory underlying the generation of partially synthetic data, as well as typical implementations in practice, replacement values are sampled from posterior predictive distributions. That is, the agency repeatedly draws values of the model parameters from their posterior distributions, and generates a set of replacement values based on each parameter draw. The motivation for sampling from posterior predictive distributions derives from multiple imputation of missing data, in which drawing the parameters is necessary to enable approximately unbiased variance estimation (Rubin, 1987, Chapter 4). In this article, we argue that it is not necessary to draw parameters to enable valid inferences with partially synthetic data. Instead, data disseminators can estimate posterior modes or maximum likelihood estimates of parameters in synthesis models, and simulate replacement values after plugging those modes into the models. Using a simple but informative case, we show mathematically that point and variance estimates based on the plug-in

[1]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[2]  Jörg Drechsler,et al.  Using Support Vector Machines for Generating Synthetic Datasets , 2010, Privacy in Statistical Databases.

[3]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[4]  John M. Abowd,et al.  Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project , 2006 .

[5]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[6]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[7]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[8]  Jerome P. Reiter,et al.  A Comparison of Posterior Simulation and Inference by Combining Rules for Multiple Imputation , 2011 .

[9]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[10]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[11]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[12]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[13]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[14]  Fang Liu,et al.  Statistical Disclosure Techniques Based on Multiple Imputation , 2005 .

[15]  Jerome P. Reiter,et al.  Tests of multivariate hypotheses when using multiple imputation for missing data and disclosure limitation , 2010 .

[16]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[17]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.