An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[3]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[4]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[5]  Julia Lane,et al.  Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances , 2006, Privacy in Statistical Databases.

[6]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[7]  Raul Cano On The Bayesian Bootstrap , 1992 .

[8]  Jörg Drechsler,et al.  Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data , 2008, Privacy in Statistical Databases.

[9]  Jerome P. Reiter,et al.  Making public use , synthetic files of the Longitudinal Business Database , 2022 .

[10]  Javier M. Moguerza,et al.  Support Vector Machines with Applications , 2006, math/0612817.

[11]  Keying Ye,et al.  Applied Bayesian Modeling and Causal Inference From Incomplete-Data Perspectives , 2005, Technometrics.

[12]  L. Sweeney Computational Disclosure Control for Medical Microdata , 1997 .

[13]  J. Gerring A case study , 2011, Technology and Society.

[14]  Gary Benedetto,et al.  Distribution-Preserving Statistical Disclosure Limitation , 2007, Comput. Stat. Data Anal..

[15]  Fang Liu,et al.  Statistical Disclosure Techniques Based on Multiple Imputation , 2005 .

[16]  Jerome P. Reiter,et al.  Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data , 2006, Privacy in Statistical Databases.

[17]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[18]  Andrew Gelman,et al.  Applied Bayesian Modeling And Causal Inference From Incomplete-Data Perspectives , 2005 .

[19]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[20]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[21]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[22]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[23]  Insuk Sohn,et al.  Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine , 2009, Comput. Stat. Data Anal..

[24]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[25]  Hosik Choi,et al.  Gene selection and prediction for cancer classification using support vector machines with a reject option , 2011, Comput. Stat. Data Anal..

[26]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[27]  Thomas Zwick,et al.  A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access , 2008 .

[28]  Richard Penny,et al.  Multiply Imputed Synthetic Data Files , 2007 .

[29]  Giuseppe Porro,et al.  Missing data imputation, matching and other applications of random recursive partitioning , 2007, Comput. Stat. Data Anal..

[30]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[31]  John M. Abowd,et al.  Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project , 2006 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[34]  Natalie Shlomo,et al.  Assessing Identification Risk in Survey Microdata Using Log-Linear Models , 2008 .

[35]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[36]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[37]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[38]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[39]  A. Kennickell Multiple Imputation and Disclosure Protection : TheCase of the 1995 Survey of Consumer Finances , 2000 .

[40]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[41]  Joerg Drechsler,et al.  New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey , 2012 .

[42]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[43]  Jörg Drechsler,et al.  Using Support Vector Machines for Generating Synthetic Datasets , 2010, Privacy in Statistical Databases.

[44]  D. Pager,et al.  Estimating Risk , 2010, Social psychology quarterly.

[45]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[46]  M. Elliot,et al.  A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records , 2007 .

[47]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .