Perturbed robust linear estimating equations for confidentiality protection in remote analysis

National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.

[1]  Douglas P. Wiens,et al.  Jackknifing, weighting, diagnostics and variance estimation in generalized M-estimation , 2000 .

[2]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[3]  S. Sheather,et al.  Robust Estimation and Testing , 1990 .

[4]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[5]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[6]  Jerome P. Reiter,et al.  Categorical data regression diagnostics for remote access servers , 2005 .

[7]  James O. Chipperfield,et al.  A Summary of Attack Methods and Confidentiality Protection Measures for Fully Automated Remote Analysis Systems , 2013 .

[8]  Rathindra Sarathy,et al.  Evaluating Laplace Noise Addition to Satisfy Differential Privacy for Numeric Data , 2011, Trans. Data Priv..

[9]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control: Hundepool/Statistical Disclosure Control , 2012 .

[10]  George T. Duncan,et al.  Why Statistical Confidentiality , 2011 .

[11]  Damien McAullay,et al.  Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics® , 2008, Comput. Methods Programs Biomed..

[12]  S. Sheather,et al.  Robust Estimation & Testing: Staudte/Robust , 1990 .

[13]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[14]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[15]  Christine M. O'Keefe,et al.  Applicability of Regression-Tree-Based Synthetic Data Methods for Business Data , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[16]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[17]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[18]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[19]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[20]  D. Ruppert,et al.  A Note on Computing Robust Regression Estimates via Iteratively Reweighted Least Squares , 1988 .

[21]  Anna Oganian,et al.  A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality , 2006 .

[22]  Christine M. O'Keefe,et al.  Regression output from a remote analysis server , 2009, Data Knowl. Eng..

[23]  Donald B Rubin,et al.  Individual privacy versus public good: protecting confidentiality in health research , 2015, Statistics in medicine.

[24]  James O. Chipperfield,et al.  Disclosure‐protected Inference Using Generalised Linear Models , 2014 .

[25]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[26]  Cynthia Dwork,et al.  Differential privacy and robust statistics , 2009, STOC '09.

[27]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[28]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[29]  Adam D. Smith,et al.  Efficient, Differentially Private Point Estimators , 2008, ArXiv.

[30]  Natalie Shlomo,et al.  Comparison of Remote Analysis with Statistical Disclosure Control for Protecting the Confidentiality of Business Data , 2012, Trans. Data Priv..

[31]  R. Chambers,et al.  Estimating distribution functions from survey data , 1986 .

[32]  Mark Westcott,et al.  Protecting confidentiality in statistical analysis outputs from a virtual data centre , 2013 .

[33]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[34]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[35]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[36]  Christine M. O'Keefe,et al.  Comparison of Two Remote Access Systems Recently Developed and Implemented in Australia , 2014, Privacy in Statistical Databases.

[37]  Jerome P. Reiter,et al.  Model Diagnostics for Remote Access Regression Servers , 2003, Stat. Comput..