On Regression-Tree-Based Synthetic Data Methods for Business Data

This paper concerns the use of synthetic data for protecting the confidentiality of business data during statistical analysis. Synthetic data sets are traditionally constructed by replacing sensitive values in a confidential data set with draws from statistical models estimated on the confidential data set. Unfortunately, the process of generating effective statistical models can be a difficult and labour-intensive task. Recently, it has been proposed to use easily-implemented methods from machine learning instead of statistical model estimation in the data synthesis task. J. Drechsler and J.P. Reiter (2011) have conducted an evaluation of four such methods, and have found that regression trees could give rise to synthetic data sets which provide reliable analysis results as well as low disclosure risks. Their conclusion was based on simulations using a subset of the 2002 Uganda census public use file. It is an interesting question whether the same conclusion applies to other types of data with different characteristics, for example business data which have quite different characteristics from population census and survey data. In particular, business data generally have few variables that are mostly categorical, and often have highly skewed distributions with outliers. In this paper we investigate the applicability of regression-tree-based methods for constructing synthetic business data. We give a detailed example comparing exploratory data analysis and linear regression results under two variants of a regression-tree-based synthetic data approach. We also include an evaluation of the analysis results with respect to the results of analysis of the original data. We further investigate the impact of different stopping criteria on performance. While it is certainly true that any method designed to protect confidentiality introduces error, and may indeed give misleading conclusions, our analysis of the results for synthesisers based on CART models has provided some evidence that this error is not random but is due to the particular characteristics of business data. We conclude that more careful analysis needs to be done in applying these methods and end users certainly need aware of possible discrepancies.

[1]  R. Chambers,et al.  Estimating distribution functions from survey data , 1986 .

[2]  Koen De Backer,et al.  An OECD perspective on microdata access: Trends, opportunities and challenges , 2009 .

[3]  Natalie Shlomo,et al.  Comparison of Remote Analysis with Statistical Disclosure Control for Protecting the Confidentiality of Business Data , 2012, Trans. Data Priv..

[4]  Jerome P. Reiter,et al.  Model Diagnostics for Remote Access Regression Servers , 2003, Stat. Comput..

[5]  Damien McAullay,et al.  Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics® , 2008, Comput. Methods Programs Biomed..

[6]  Natalie Shlomo,et al.  Protection of micro-data subject to edit constraints against Statistical Disclosure , 2008 .

[7]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[8]  D. Rubin The Bayesian Bootstrap , 1981 .

[9]  C. Gini Variabilita e Mutabilita. , 1913 .

[10]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[11]  Giuseppe Porro,et al.  Missing data imputation, matching and other applications of random recursive partitioning , 2007, Comput. Stat. Data Anal..

[12]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[13]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[14]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[15]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[16]  Christine M. O'Keefe,et al.  Confidentialising Exploratory Data Analysis Output in Remote Analysis , 2012 .

[17]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[18]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .