Privacy protection of textual attributes through a semantic-based masking method

Using microdata provided by statistical agencies has many benefits from the data mining point of view. However, such data often involve sensitive information that can be directly or indirectly related to individuals. An appropriate anonymisation process is needed to minimise the risk of disclosure. Several masking methods have been developed to deal with continuous-scale numerical data or bounded textual values but approaches to tackling the anonymisation of textual values are scarce and shallow. Because of the importance of textual data in the Information Society, in this paper we present a new masking method for anonymising unbounded textual values based on the fusion of records with similar values to form groups of indistinguishable individuals. Since, from the data exploitation point of view, the utility of textual information is closely related to the preservation of its meaning, our method relies on the structured knowledge representation given by ontologies. This domain knowledge is used to guide the masking process towards the merging that best preserves the semantics of the original data. Because textual data typically consist of large and heterogeneous value sets, our method provides a computationally efficient algorithm by relying on several heuristics rather than exhaustive searches. The method is evaluated with real data in a concrete data mining application that involves solving a clustering problem. We also compare the method with more classical approaches that focus on optimising the value distribution of the dataset. Results show that a semantically grounded anonymisation best preserves the utility of data in both the theoretical and the practical setting, and reduces the probability of record linkage. At the same time, it achieves good scalability with regard to the size of input data.

[1]  Ted Briscoe,et al.  32nd Annual Meeting of the Association for Computational Linguistics, 27-30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings , 1994, ACL.

[2]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[3]  C. Fellbaum An Electronic Lexical Database , 1998 .

[4]  Xintao Wu,et al.  Privacy Preserving Categorical Data Analysis with Unknown Distortion Parameters , 2009, Trans. Data Priv..

[5]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[6]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[7]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[8]  David Sánchez,et al.  Content annotation for the semantic web: an automatic web-based approach , 2011, Knowledge and Information Systems.

[9]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[10]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[11]  Galina L. Rogova,et al.  Designing ontologies for higher level fusion , 2009, Inf. Fusion.

[12]  Mieczyslaw M. Kokar,et al.  Ontology-based situation awareness , 2009, Inf. Fusion.

[13]  David Sánchez,et al.  Ontology-driven web-based semantic similarity , 2010, Journal of Intelligent Information Systems.

[14]  Kent A. Spackman,et al.  SNOMED RT: a reference terminology for health care , 1997, AMIA.

[15]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[16]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[17]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[18]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[19]  Sarah Giessing Survey on Methods for Tabular Data Protection in ARGUS , 2004, Privacy in Statistical Databases.

[20]  Vicenç Torra,et al.  Data privacy , 2014, Advanced Research in Data Privacy.

[21]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[22]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[23]  Roger Barga,et al.  Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006, Atlanta, GA, USA , 2006, ICDE Workshops.

[24]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[25]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Vicenc Torra,et al.  Information Fusion in Data Mining , 2003 .

[27]  David Sánchez,et al.  Using ontologies for structuring organizational knowledge in Home Care assistance , 2010, Int. J. Medical Informatics.

[28]  Rudolf Kruse,et al.  Information Processing and Management of Uncertainty in Knowledge-Based Systems , 2011 .

[29]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[30]  Jian Xu,et al.  Utility-based anonymization for privacy preservation with less information loss , 2006, SKDD.

[31]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[32]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[33]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[34]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[35]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[36]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[37]  Ninghui Li,et al.  Towards optimal k-anonymization , 2008, Data Knowl. Eng..

[38]  Trevor P. Martin,et al.  Soft Concept Hierarchies to Summarise Data Streams and Highlight Anomalous Changes , 2010, IPMU.

[39]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[40]  David Sánchez,et al.  Anonymizing Categorical Data with a Recoding Method Based on Semantic Similarity , 2010, IPMU.

[41]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[42]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[43]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[44]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.