Empirical privacy and empirical utility of anonymized data

Procedures to anonymize data sets are vital for companies, government agencies and other bodies to meet their obligations to share data without compromising the privacy of the individuals contributing to it. Despite much work on this topic, the area has not yet reached stability. Early models (k-anonymity and ℓ-diversity) are now thought to offer insufficient privacy. Noise-based methods like differential privacy are seen as providing stronger privacy, but less utility. However, across all methods sensitive information of some individuals can often be inferred with relatively high accuracy. In this paper, we reverse the idea of a `privacy attack,' by incorporating it into a measure of privacy. Hence, we advocate the notion of empirical privacy, based on the posterior beliefs of an adversary, and their ability to draw inferences about sensitive values in the data. This is not a new model, but rather a unifying view: it allows us to study several well-known privacy models which are not directly comparable otherwise. We also consider an empirical approach to measuring utility, based on a workload of queries. Consequently, we are able to place different privacy models including differential privacy and early syntactic models on the same scale, and compare their privacy/utility tradeoff. We learn that, in practice, the difference between differential privacy and various syntactic models is less dramatic than previously thought, but there are still clear domination relations between them.

[1]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[2]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[4]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[5]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[6]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[7]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[9]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Ashwin Machanavajjhala,et al.  Privacy-Preserving Data Publishing , 2009, Found. Trends Databases.

[11]  Daniel Kifer,et al.  Attacks on privacy and deFinetti's theorem , 2009, SIGMOD Conference.

[12]  Ninghui Li,et al.  On the tradeoff between privacy and utility in data publishing , 2009, KDD.

[13]  Elisa Bertino,et al.  Private record matching using differential privacy , 2010, EDBT '10.

[14]  Graham Cormode Individual Privacy vs Population Privacy: Learning to Attack Anonymization , 2010, ArXiv.

[15]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[16]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[17]  Graham Cormode,et al.  Personal privacy vs population privacy: learning to attack anonymization , 2011, KDD.

[18]  Divesh Srivastava,et al.  Differentially Private Spatial Decompositions , 2011, 2012 IEEE 28th International Conference on Data Engineering.