Robust Statistics Meets SDC: New Disclosure Risk Measures for Continuous Microdata Masking

The aim of this study is to evaluate the risk of re-identification related to distance-based disclosure risk measures for numerical variables. First, we overview different - already proposed - disclosure risk measures. Unfortunately, all these measures do not account for outliers. We assume that outliers must be protected more than observations near the center of the data cloud. Therefore, we propose a weighting scheme for each observation based on the concept of robust Mahalanobis distances. We also consider the peculiarities of different protection methods and adapt our measures to be able to give realistic measures for each method. In order to test our proposed distance based disclosure risk measures we run a simulation study with different amounts of data contamination. The results of the simulation study shows the usefulness of the proposed measures and gives deeper insights into how the risk of quantitative data can be measured successfully. All the methods proposed and all the protection methods plus measures used in this paper are implemented in R-package sdcMicro which is freely available on the comprehensive R archive network (http://cran.r-project.org).

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  P. Filzmoser A MULTIVARIATE OUTLIER DETECTION METHOD , 2004 .

[3]  Sarah Giessing,et al.  Report on preparation of the data set and improvements on Sullivans algorithm , 2002 .

[4]  Josep Domingo-Ferrer,et al.  Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment , 2006, Privacy in Statistical Databases.

[5]  Josep Domingo-Ferrer,et al.  Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata , 2005, Data Mining and Knowledge Discovery.

[6]  Stefan Bender,et al.  Re-identifying Register Data by Survey Data Using Cluster Analysis: An Empirical Study , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  M. Templ,et al.  Software Development for SDC in , 2006, Privacy in Statistical Databases.

[8]  M. Templ,et al.  Why Shuffle When You Can Use Robust Statistics for SDC-A Simulation Study , 2008 .

[9]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[10]  Chris J. Skinner,et al.  Record level measures of disclosure risk for survey microdata , 2006 .

[11]  Josep Domingo-Ferrer,et al.  Outlier Protection in Continuous Microdata Masking , 2004, Privacy in Statistical Databases.

[12]  Luisa Franconi,et al.  Statistical and Technological Solutions for Controlled Data Dissemination , 1998 .

[13]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[14]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[15]  S. Fienberg,et al.  ROMM Methodology for Microdata Release , 2005 .

[16]  Luisa Franconi,et al.  Individual Risk Estimation in µ-Argus: A Review , 2004, Privacy in Statistical Databases.

[17]  Rathindra Sarathy,et al.  Data Shuffling - A New Masking Approach for Numerical Data , 2006, Manag. Sci..

[18]  M. Templ sdcMicro : a new flexible R-package for the generation of anonymised microdata : Design issues and new methods , 2007 .

[19]  Rathindra Sarathy,et al.  Why Swap When You Can Shuffle? A Comparison of the Proximity Swap and Data Shuffle for Numeric Data , 2006, Privacy in Statistical Databases.