Toward sensitive document release with privacy guarantees

Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments. Theory and implementation of (C,g(C))-sanitization, a semantic document sanitization privacy model.An intuitive mechanism to configure the trade-off between privacy and utility.An heuristic and scalable algorithm implementing (C,g(C))-sanitization.An accurate assessment of disclosure risks based on the Web's information distribution.

[1]  David Sánchez,et al.  Utility-preserving sanitization of semantically correlated terms in textual documents , 2014, Inf. Sci..

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[4]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Stan Matwin,et al.  18th European Conference on Machine Learning , 2007 .

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[9]  David Sánchez,et al.  Automatic General-Purpose Sanitization of Textual Documents , 2013, IEEE Transactions on Information Forensics and Security.

[10]  J. Jenkins,et al.  Word association norms , 1964 .

[11]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[12]  Benoît Lemaire,et al.  Effects of High-Order Co-occurrences on Word Semantic Similarities , 2006, ArXiv.

[13]  David Sánchez,et al.  Minimizing the disclosure risk of semantic correlations in document sanitization , 2013, Inf. Sci..

[14]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[15]  David Sánchez,et al.  Utility-preserving privacy protection of textual healthcare documents , 2014, J. Biomed. Informatics.

[16]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[17]  Jessica Staddon,et al.  Detecting privacy leaks using corpus-based association rules , 2008, KDD.

[18]  David Sánchez,et al.  A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge , 2012, Int. J. Semantic Web Inf. Syst..

[19]  Klaus Winkelmann Conference on Innovative Applications of Artificial Intelligence , 1989, Künstliche Intell..

[20]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[21]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[22]  Balamurugan Anandan,et al.  t-Plausibility: Generalizing Words to Desensitize Text , 2012, Trans. Data Priv..

[23]  N. Terry,et al.  Ensuring the Privacy and Confidentiality of Electronic Health Records , 2007 .

[24]  Adimoolam M.,et al.  Authorship Protection for Textual Documents , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[25]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[26]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control , 2012 .

[27]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jessica Staddon,et al.  Web-Based Inference Detection , 2007, USENIX Security Symposium.

[29]  A. Policy Review of the 2002 Department of Health and Human Service Notice of Proposed Rule Making for The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Regulations , 2002 .

[30]  Justin Zhijun Zhan,et al.  Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining , 2016, Eng. Appl. Artif. Intell..

[31]  David Sánchez,et al.  C‐sanitized: A privacy model for document redaction and sanitization , 2014, J. Assoc. Inf. Sci. Technol..

[32]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[33]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[34]  Shlomo Argamon,et al.  Authorship Attribution: What's Easy and What's Hard? , 2013 .

[35]  Vicenç Torra Towards Knowledge Intensive Data Privacy , 2010, DPM/SETOP.

[36]  Gene Tsudik,et al.  Fighting authorship linkability with crowdsourcing , 2014, COSN '14.

[37]  David Sánchez,et al.  Ontology-driven web-based semantic similarity , 2010, Journal of Intelligent Information Systems.

[38]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[39]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[40]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[41]  Tzung-Pei Hong,et al.  A sanitization approach for hiding sensitive itemsets based on particle swarm optimization , 2016, Eng. Appl. Artif. Intell..

[42]  Jessica Staddon,et al.  The Rules of Redaction: Identify, Protect, Review (and Repeat) , 2009, IEEE Security & Privacy.

[43]  Josep Domingo-Ferrer,et al.  Database Anonymization: Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections , 2016, Database Anonymization.