Comparison of Remote Analysis with Statistical Disclosure Control for Protecting the Confidentiality of Business Data

This paper is concerned with the challenge of allowing statistical analysis of confidential business data while maintaining confidentiality. The most widely-used approach to date is statistical disclosure control, which involves modifying or confidentialising data before releasing it to users. Newer proposed approaches include the release of multiply imputed synthetic data in place of the original data, and the use of a remote analysis system enabling users to submit statistical queries and receive output without direct access to data. Most implementations of statistical disclosure control methods to date involve census or survey microdata on individual persons, because existing methods are generally acknowledged to provide inadequate confidentiality protection to business (or enterprise) data. In this paper we seek to compare the statistical disclosure control approach with the remote analysis approach, in the context of protecting the confidentiality of business data in statistical analysis. We provide an example which enables a side-by-side comparison of the outputs of exploratory data analysis and linear regression analysis conducted on a sample business dataset under these two approaches, and provide traditional unconfidentialised results as a standard for comparison. There are certainly advantages and disadvantages in the remote analysis approach and it is unlikely that remote analysis will replace statistical disclosure control methods in all applications. If the disadvantages are judged too serious in a given situation, the analyst may have to seek access to the unconfidentialised dataset. However, our example supports the conclusion that the advantages may outweigh the disadvantages in some cases, including for some analyses of unconfidentialised business data, provided the analyst is aware of the output confidentialisation methods and their potential impact.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[3]  地理学 United States Census Bureau , 2011 .

[4]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[5]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[6]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[7]  Joerg Drechsler,et al.  New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey , 2012 .

[8]  Anco Hundepool The CASC Project , 2002, Inference Control in Statistical Databases.

[9]  Koen De Backer,et al.  An OECD perspective on microdata access: Trends, opportunities and challenges , 2009 .

[10]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[11]  Damien McAullay,et al.  Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics® , 2008, Comput. Methods Programs Biomed..

[12]  Christine M. O'Keefe,et al.  Regression output from a remote analysis server , 2009, Data Knowl. Eng..

[13]  Natalie Shlomo,et al.  Protection of micro-data subject to edit constraints against Statistical Disclosure , 2008 .

[14]  Christine M. O'Keefe,et al.  Confidentialising Exploratory Data Analysis Output in Remote Analysis , 2012 .

[15]  Jerome P. Reiter,et al.  Model Diagnostics for Remote Access Regression Servers , 2003, Stat. Comput..

[16]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[17]  Felix Ritchie,et al.  Disclosure detection in research environments in practice , 2007 .

[18]  R. Chambers,et al.  Estimating distribution functions from survey data , 1986 .