Improving data utility in differential privacy and k-anonymity

We focus on two mainstream privacy models: k-anonymity and differential privacy. Once a privacy model has been selected, the goal is to enforce it while preserving as much data utility as possible. The main objective of this thesis is to improve the data utility in k-anonymous and differentially private data releases. k-Anonymity has several drawbacks. On the disclosure limitation side, there is a lack of protection against attribute disclosure and against informed intruders. On the data utility side, dealing with a large number of quasi-identifier attributes is problematic. We propose a relaxation of k-anonymity that deals with these issues. Differential privacy limits disclosure risk through noise addition. The Laplace distribution is commonly used for the random noise. We show that the Laplace distribution is not optimal: the same disclosure limitation guarantee can be attained by adding less noise. Optimal univariate and multivariate noises are characterized and constructed. Common mechanisms to attain differential privacy do not take into account the users prior knowledge; they implicitly assume zero initial knowledge about the query response. We propose a mechanism that focuses on limiting the knowledge gain over the prior knowledge. Microaggregation-based k-anonymity and differential privacy can be combined to produce microdata releases with the strong privacy guarantees of differential privacy and improved data accuracy. The last contribution delves into the relation between t-closeness and differential privacy. We see that for a specific distance and under some reasonable assumptions on the intruders knowledge, t-closeness leads to differential privacy.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  Dorothy J. Glancy The Invention of the Right to Privacy , 1979 .

[3]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[5]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[6]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[7]  Jeroen Terstegge Privacy in the Law , 2007, Security, Privacy, and Trust in Modern Data Management.

[8]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  H. Feistel Cryptography and Computer Privacy , 1973 .

[11]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[12]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[13]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[14]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[16]  L. Zayatz Disclosure avoidance practices and research at the U.S. Census Bureau: an update , 2007 .

[17]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[18]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[19]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[20]  Josep Domingo-Ferrer,et al.  Probabilistic k-anonymity through microaggregation and data swapping , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[21]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[22]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[23]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[24]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[25]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[26]  Rathindra Sarathy,et al.  Evaluating Laplace Noise Addition to Satisfy Differential Privacy for Numeric Data , 2011, Trans. Data Priv..

[27]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[29]  V. Torra,et al.  Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk , 2004 .

[30]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[31]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[32]  Josep Domingo-Ferrer,et al.  Optimal data-independent noise for differential privacy , 2013, Inf. Sci..

[33]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[34]  Benjamin C. M. Fung,et al.  Publishing set-valued data via differential privacy , 2011, Proc. VLDB Endow..

[35]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[36]  David Sánchez,et al.  Semantically-grounded construction of centroids for datasets with textual attributes , 2012, Knowl. Based Syst..

[37]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[38]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[39]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[40]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[41]  H. Humphrey,et al.  Standards for privacy of individually identifiable health information. , 2003, Health care law monthly.

[42]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[43]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.

[44]  Montserrat Batet Sanromà,et al.  A distance function to assess the similarity of words using ontologies , 2010 .

[45]  Josep Domingo-Ferrer,et al.  Differential Privacy through Knowledge Refinement , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[46]  Josep Domingo-Ferrer,et al.  Sensitivity-Independent differential Privacy via Prior Knowledge Refinement , 2012, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[47]  Giri Kumar Tayi,et al.  Examining data quality , 1998, CACM.

[48]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[49]  Elisa Bertino,et al.  A Survey of Quantification of Privacy Preserving Data Mining Algorithms , 2008, Privacy-Preserving Data Mining.

[50]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[51]  Kobbi Nissim,et al.  Private Data Analysis via Output Perturbation - A Rigorous Approach to Constructing Sanitizers and Privacy Preserving Algorithms , 2008, Privacy-Preserving Data Mining.

[52]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[53]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[54]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[55]  J. Rubenfeld The Right of Privacy , 1989 .

[56]  Josep Domingo-Ferrer,et al.  Marginality: A Numerical Mapping for Enhanced Exploitation of Taxonomic Attributes , 2012, MDAI.

[57]  Sabrina De Capitani di Vimercati,et al.  k -Anonymous Data Mining: A Survey , 2008, Privacy-Preserving Data Mining.

[58]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[59]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[60]  A. Solanas,et al.  V-MDAV : A Multivariate Microaggregation With Variable Group Size , 2006 .

[61]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[62]  Anne-Sophie Charest Empirical Evaluation of Statistical Inference from Differentially-Private Contingency Tables , 2012, Privacy in Statistical Databases.

[63]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[64]  David Sánchez,et al.  A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge , 2012, Int. J. Semantic Web Inf. Syst..

[65]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[66]  Douglas A. Kysar Kids & Cul-De-Sacs: Census 2000 and the Reproduction of Consumer Culture , 2001 .

[67]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[68]  Rathindra Sarathy,et al.  Some Additional Insights on Applying Differential Privacy for Numeric Data , 2010, Privacy in Statistical Databases.

[69]  Balachander Krishnamurthy,et al.  I know what you will do next summer , 2010, CCRV.

[70]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[71]  Josep Domingo-Ferrer,et al.  Differential privacy via t-closeness in data publishing , 2013, 2013 Eleventh Annual Conference on Privacy, Security and Trust.

[72]  Josep Domingo-Ferrer,et al.  LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection , 2002, Inference Control in Statistical Databases.

[73]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[74]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[75]  Agusti Solanas,et al.  A variable-MDAV-based partitioning strategy to continuous multivariate microaggregation with genetic algorithms , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[76]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[77]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[78]  Herbert Burkert,et al.  Some Preliminary Comments on the DIRECTIVE 95/46/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. , 1996 .

[79]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[80]  Hua Wang,et al.  Achieving P-Sensitive K-Anonymity via Anatomy , 2009, 2009 IEEE International Conference on e-Business Engineering.

[81]  Rathindra Sarathy,et al.  Does Differential Privacy Protect Terry Gross' Privacy? , 2010, Privacy in Statistical Databases.

[82]  Euripides G. M. Petrakis,et al.  X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies , 2006, J. Digit. Inf. Manag..

[83]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[84]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[85]  Suresh Venkatasubramanian Measures of Anonymity , 2008, Privacy-Preserving Data Mining.

[86]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[87]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[88]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.