More Hybrid and Secure Protection of Statistical Data Sets

Different methods and paradigms to protect data sets containing sensitive statistical information have been proposed and studied. The idea is to publish a perturbed version of the data set that does not leak confidential information, but that still allows users to obtain meaningful statistical values about the original data. The two main paradigms for data set protection are the classical one and the synthetic one. Recently, the possibility of combining the two paradigms, leading to a hybrid paradigm, has been considered. In this work, we first analyze the security of some synthetic and (partially) hybrid methods that have been proposed in the last years, and we conclude that they suffer from a high interval disclosure risk. We then propose the first fully hybrid SDC methods; unfortunately, they also suffer from a quite high interval disclosure risk. To mitigate this, we propose a postprocessing technique that can be applied to any data set protected with a synthetic method, with the goal of reducing its interval disclosure risk. We describe through the paper a set of experiments performed on reference data sets that support our claims.

[1]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[2]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[3]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[4]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[5]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[6]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  Rathindra Sarathy,et al.  Generating Sufficiency-based Non-Synthetic Perturbed Data , 2008, Trans. Data Priv..

[8]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[9]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[10]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[12]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  William E. Winkler,et al.  Matching and record linkage , 2011 .

[14]  Bing-Rong Lin,et al.  Towards an axiomatization of statistical privacy and utility , 2010, PODS.

[15]  Ashwin Machanavajjhala,et al.  Data Publishing against Realistic Adversaries , 2009, Proc. VLDB Endow..

[16]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[17]  Jay-J. Kim A METHOD FOR LIMITING DISCLOSURE IN MICRODATA BASED ON RANDOM NOISE AND , 2002 .

[18]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[19]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[20]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..