Semantic adaptive microaggregation of categorical microdata

In the context of Statistical Disclosure Control, microaggregation is a privacy-preserving method aimed to mask sensitive microdata prior to publication. It iteratively creates clusters of, at least, k elements, and replaces them by their prototype so that they become k-indistinguishable (anonymous). This data transformation produces a loss of information with regards to the original dataset which affects the utility of masked data, so, the aim of microaggregation algorithms is to find the partition that minimises the information loss while ensuring a certain level of privacy. Most microaggregation methods, such as the MDAV algorithm, which is the focus of this paper, have been designed for numerical data. Extending them to support non-numerical (categorical) attributes is not straightforward because of the limitations on defining appropriate aggregation operators. Concretely, related works focused on the MDAV algorithm propose grouping data into groups with constrained size (or even fixed) and/or incorporate a basic categorical treatment of non-numerical data. This approach affects negatively the utility of the protected dataset because neither the distributional characteristics of data nor their underlying semantics are properly considered. In this paper, we propose a set of modifications to the MDAV algorithm focused on categorical microdata. Our approach has been evaluated and compared with related works when protecting real datasets with textual attribute values. Results show that our method produces masked datasets that better minimises the information loss resulting from the data transformation.

[1]  Jun-Lin Lin,et al.  An efficient clustering method for k-anonymization , 2008, PAIS '08.

[2]  Ted Briscoe,et al.  32nd Annual Meeting of the Association for Computational Linguistics, 27-30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings , 1994, ACL.

[3]  David Sánchez,et al.  The Role of Ontologies in the Anonymization of Textual Variables , 2010, CCIA.

[4]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[5]  Vicenc Torra,et al.  Information Fusion in Data Mining , 2003 .

[6]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[7]  Vikram Pudi,et al.  Proceedings of the 12th international conference on Database systems for advanced applications , 2007 .

[8]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[9]  Joaquin Garcia-Alfaro,et al.  Proceedings of the 5th international Workshop on data privacy management, and 3rd international conference on Autonomous spontaneous security , 2010 .

[10]  Hisham M. Haddad,et al.  Proceedings of the 2007 ACM Symposium on Applied Computing (SAC), Seoul, Korea, March 11-15, 2007 , 2007, SAC.

[11]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[12]  Chieh-Yuan Tsai,et al.  A k -Anonymity Clustering Method for Effective Data Privacy Preservation , 2007, ADMA.

[13]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[14]  C. Fellbaum An Electronic Lexical Database , 1998 .

[15]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[16]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[17]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[18]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[19]  Xin Jin,et al.  ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing , 2011, Inf. Syst..

[20]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[21]  Vicenç Torra Towards Knowledge Intensive Data Privacy , 2010, DPM/SETOP.

[22]  David Sánchez,et al.  Ontology-driven web-based semantic similarity , 2010, Journal of Intelligent Information Systems.

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Vicenç Torra,et al.  Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[25]  Elisa Bertino,et al.  Efficient k -Anonymization Using Clustering Techniques , 2007, DASFAA.

[26]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[27]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[28]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[29]  David Sánchez,et al.  Privacy protection of textual attributes through a semantic-based masking method , 2012, Inf. Fusion.

[30]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[31]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases , 2002, Lecture Notes in Computer Science.

[32]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[33]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[34]  Vicenç Torra,et al.  Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[35]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[36]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[38]  Javier Herranz,et al.  On the disclosure risk of multivariate microaggregation , 2008, Data Knowl. Eng..

[39]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[40]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[41]  TerrovitisManolis,et al.  Privacy-preserving anonymization of set-valued data , 2008, VLDB 2008.

[42]  Vicenç Torra,et al.  Towards Semantic Microaggregation of Categorical Data for Confidential Documents , 2010, MDAI.

[43]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[44]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[45]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[46]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[47]  Farshad Fotouhi,et al.  Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society, PAIS 2008, Nantes, France, March 29, 2008 , 2008, PAIS.

[48]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[49]  Vijayalakshmi Atluri,et al.  Anonymization models for directional location based service environments , 2010, Comput. Secur..

[50]  Wen Hu,et al.  Preserving privacy in participatory sensing systems , 2010, Comput. Commun..

[51]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[52]  Stan Matwin,et al.  Classifying data from protected statistical datasets , 2010, Comput. Secur..

[53]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[54]  Osmar R. Zaïane,et al.  A privacy-preserving clustering approach toward secure and effective data analysis for business collaboration , 2007, Comput. Secur..

[55]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[56]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[57]  Sadaaki Miyamoto,et al.  Evaluating Fuzzy Clustering Algorithms for Microdata Protection , 2004, Privacy in Statistical Databases.