Database Anonymization: Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections

The current social and economic context increasingly demands open data to improve scientific research and decision making. However, when published data refer to individual respondents, disclosure risk limitation techniques must be implemented to anonymize the data and guarantee by design the fundamental right to privacy of the subjects the data refer to. Disclosure risk limitation has a long record in the statistical and computer science research communities, who have developed a variety of privacy-preserving solutions for data releases. This Synthesis Lecture provides a comprehensive overview of the fundamentals of privacy in data releases focusing on the computer science perspective. Specifically, we detail the privacy models, anonymization methods, and utility and risk metrics that have been proposed so far in the literature. Besides, as a more advanced topic, we identify and discuss in detail connections between several privacy models (i.e., how to accumulate the privacy guarantees they offer to achieve more robust protection and when such guarantees are equivalent or complementary); we also explore the links between anonymization methods and privacy models (how anonymization methods can be used to enforce privacy models and thereby offer ex ante privacy guarantees). These latter topics are relevant to researchers and advanced practitioners, who will gain a deeper understanding on the available data anonymization solutions and the privacy guarantees they can offer.

[1]  Josep Domingo-Ferrer,et al.  Rule Protection for Indirect Discrimination Prevention in Data Mining , 2011, MDAI.

[2]  H. Feistel Cryptography and Computer Privacy , 1973 .

[3]  Josep Domingo-Ferrer,et al.  A Methodology for Direct and Indirect Discrimination Prevention in Data Mining , 2013, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[5]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control: Hundepool/Statistical Disclosure Control , 2012 .

[6]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[7]  Josep Domingo-Ferrer,et al.  Generalization-based privacy preservation and discrimination prevention in data publishing and mining , 2014, Data Mining and Knowledge Discovery.

[8]  Josep Domingo-Ferrer,et al.  Improving the Utility of Differentially Private Data Releases via k-Anonymity , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[9]  David Sánchez,et al.  A semantic framework to protect the privacy of electronic health records with non-numerical attributes , 2013, J. Biomed. Informatics.

[10]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[11]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[12]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[13]  Josep Domingo-Ferrer,et al.  Optimal data-independent noise for differential privacy , 2013, Inf. Sci..

[14]  Josep Domingo-Ferrer,et al.  Enhancing data utility in differential privacy via microaggregation-based k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2014, The VLDB Journal.

[15]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[16]  Rathindra Sarathy,et al.  Generating Sufficiency-based Non-Synthetic Perturbed Data , 2008, Trans. Data Priv..

[17]  Josep Domingo-Ferrer,et al.  Marginality: A Numerical Mapping for Enhanced Exploitation of Taxonomic Attributes , 2012, MDAI.

[18]  Josep Domingo-Ferrer,et al.  Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata , 2005, Data Mining and Knowledge Discovery.

[19]  Josep Domingo-Ferrer,et al.  t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[20]  David Sánchez,et al.  C‐sanitized: A privacy model for document redaction and sanitization , 2014, J. Assoc. Inf. Sci. Technol..

[21]  David Sánchez,et al.  A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge , 2012, Int. J. Semantic Web Inf. Syst..

[22]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[23]  Guy N. Rothblum,et al.  Boosting and Differential Privacy , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[24]  Divesh Srivastava,et al.  Differentially Private Spatial Decompositions , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[25]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[26]  Ashwin Machanavajjhala,et al.  Big privacy: protecting confidentiality in big data , 2012, XRDS.

[27]  Franco Turini,et al.  Discrimination-aware data mining , 2008, KDD.

[28]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[29]  Montserrat Batet,et al.  Utility preserving query log anonymization via semantic microaggregation , 2013, Inf. Sci..

[30]  Sumitra Mukherjee,et al.  A Polynomial Algorithm for Optimal Univariate Microaggregation , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[32]  Tingjian Ge,et al.  Aroma: A New Data Protection Method with Differential Privacy and Accurate Query Answering , 2014, CIKM.

[33]  Milan Petkovic,et al.  Security, Privacy, and Trust in Modern Data Management , 2007, Data-Centric Systems and Applications.

[34]  Jörg Drechsler,et al.  Synthetic datasets for statistical disclosure control , 2011 .

[35]  Josep Domingo-Ferrer,et al.  Differential privacy via t-closeness in data publishing , 2013, 2013 Eleventh Annual Conference on Privacy, Security and Trust.

[36]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[37]  Josep Domingo-Ferrer,et al.  Post-Masking Optimization of the Tradeoff between Information Loss and Disclosure Risk in Masked Microdata Sets , 2002, Inference Control in Statistical Databases.

[38]  Josep Domingo-Ferrer,et al.  From t-closeness to differential privacy and vice versa in data anonymization , 2015, Knowl. Based Syst..

[39]  Josep Domingo-Ferrer,et al.  Big Data Privacy: Challenges to Privacy Principles and Models , 2015, Data Science and Engineering.

[40]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[41]  Ninghui Li,et al.  Closeness: A New Privacy Measure for Data Publishing , 2010, IEEE Transactions on Knowledge and Data Engineering.

[42]  Josep Domingo-Ferrer,et al.  Utility-preserving differentially private data releases via individual ranking microaggregation , 2015, Inf. Fusion.

[43]  Tomasz J. Kozubowski,et al.  A discrete analogue of the Laplace distribution , 2006 .

[44]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[45]  David Sánchez,et al.  Privacy protection of textual attributes through a semantic-based masking method , 2012, Inf. Fusion.

[46]  Kian-Lee Tan,et al.  CASTLE: Continuously Anonymizing Data Streams , 2011, IEEE Transactions on Dependable and Secure Computing.

[47]  Josep Domingo-Ferrer,et al.  Anonymization of nominal data based on semantic marginality , 2013, Inf. Sci..

[48]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[49]  Ninghui Li,et al.  Differentially private grids for geospatial data , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[50]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[51]  Louis D. Brandeis,et al.  The Right to Privacy , 1890 .

[52]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[53]  Ifip Wg,et al.  The Future of Identity in the Information Society - Proceedings of the Third IFIP WG 9.2, 9.6/ 11.6, 11.7/FIDIS International Summer School on The Future of Identity in the Information Society, Karlstad University, Sweden, August 4-10, 2007 , 2008, FIDIS.

[54]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[55]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[56]  Margaret Martonosi,et al.  DP-WHERE: Differentially private modeling of human mobility , 2013, 2013 IEEE International Conference on Big Data.

[57]  Josep Domingo-Ferrer,et al.  On the Security of Microaggregation with Individual Ranking: Analytical Attacks , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[58]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[59]  David Sánchez,et al.  Towards k-Anonymous Non-numerical Data via Semantic Resampling , 2012, IPMU.

[60]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[61]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[62]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[63]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[64]  Josep Domingo-Ferrer,et al.  Co-utile Collaborative Anonymization of Microdata , 2015, MDAI.

[65]  Josep Domingo-Ferrer,et al.  Privacy by design in big data: An overview of privacy enhancing technologies in the era of big data analytics , 2015, ArXiv.

[66]  Josep Domingo-Ferrer,et al.  Probabilistic k-anonymity through microaggregation and data swapping , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[67]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[68]  Josep Domingo-Ferrer,et al.  Discrimination- and privacy-aware patterns , 2014, Data Mining and Knowledge Discovery.

[69]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.