Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics®

This paper is concerned with the challenge of enabling the use of confidential or private data for research and policy analysis, while protecting confidentiality and privacy by reducing the risk of disclosure of sensitive information. Traditional solutions to the problem of reducing disclosure risk include releasing de-identified data and modifying data before release. In this paper we discuss the alternative approach of using a remote analysis server which does not enable any data release, but instead is designed to deliver useful results of user-specified statistical analyses with a low risk of disclosure. The techniques described in this paper enable a user to conduct a wide range of methods in exploratory data analysis, regression and survival analysis, while at the same time reducing the risk that the user can read or infer any individual record attribute value. We illustrate our methods with examples from biostatistics using publicly available data. We have implemented our techniques into a software demonstrator called Privacy-Preserving Analytics (PPA), via a web-based interface to the R software. We believe that PPA may provide an effective balance between the competing goals of providing useful information and reducing disclosure risk in some situations.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Alan F. Karr,et al.  Web Systems That Disseminate Information But Protect Confidential Data , 2001 .

[5]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[6]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[7]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[8]  Christine M. O'Keefe Privacy and the Use of Health Data - Reducing Disclosure Risk , 2008 .

[9]  M. Greenacre Theory of Correspondence Analysis , 2007 .

[10]  Barry Schouten,et al.  Remote access systems for statistical analysis of microdata , 2003, Stat. Comput..

[11]  A. J. Bass,et al.  Research use of linked health data — a best practice protocol , 2002, Australian and New Zealand journal of public health.

[12]  S B Hulley,et al.  Heart and Estrogen/progestin Replacement Study (HERS): design, methods, and baseline characteristics. , 1998, Controlled clinical trials.

[13]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[14]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[15]  Jerome P. Reiter,et al.  Model Diagnostics for Remote Access Regression Servers , 2003, Stat. Comput..

[16]  S. Garattini Confidentiality , 2003, The Lancet.

[17]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[18]  J. Birch,et al.  Interactive Data Analysis , 1978 .

[19]  George T. Duncan,et al.  Enhancing Access to Microdata while Protecting Confidentiality: Prospects for the Future , 1991 .