Utility preserving query log anonymization via semantic microaggregation

Abstract Query logs are of great interest for scientists and companies for research, statistical and commercial purposes. However, the availability of query logs for secondary uses raises privacy issues since they allow the identification and/or revelation of sensitive information about individual users. Hence, query anonymization is crucial to avoid identity disclosure. To enable the publication of privacy-preserved – but still useful – query logs, in this paper, we present an anonymization method based on semantic microaggregation. Our proposal aims at minimizing the disclosure risk of anonymized query logs while retaining their semantics as much as possible. First, a method to map queries to their formal semantics extracted from the structured categories of the Open Directory Project is presented. Then, a microaggregation method is adapted to perform a semantically-grounded anonymization of query logs. To do so, appropriate semantic similarity and semantic aggregation functions are proposed. Experiments performed using real AOL query logs show that our proposal better retains the utility of anonymized query logs than other related works, while also minimizing the disclosure risk.

[1]  Vicenç Torra Towards Knowledge Intensive Data Privacy , 2010, DPM/SETOP.

[2]  James Nga-Kwok Liu,et al.  Domain ontology graph model and its application in Chinese text classification , 2012, Neural Computing and Applications.

[3]  Eugene Agichtein,et al.  Towards Privacy-Preserving Query Log Publishing , 2007 .

[4]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Nicola Guarino,et al.  Formal Ontology in Information Systems , 1998 .

[7]  David Sánchez,et al.  Automatic General-Purpose Sanitization of Textual Documents , 2013, IEEE Transactions on Information Forensics and Security.

[8]  Thierry Poibeau,et al.  Content Annotation for the Semantic Web , 2005 .

[9]  Daniel Gayo-Avello,et al.  Stratified analysis of AOL query log , 2009, Inf. Sci..

[10]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[11]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[12]  Natalia Vila-López,et al.  Mature market segmentation: a comparison of artificial neural networks and traditional methods , 2010, Neural Computing and Applications.

[13]  Montserrat Batet,et al.  Ontology-based semantic clustering , 2011, AI Commun..

[14]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[15]  Vicenç Torra,et al.  Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[16]  David Sánchez,et al.  Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines , 2013, Inf. Sci..

[17]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[18]  Javier Herranz,et al.  On the disclosure risk of multivariate microaggregation , 2008, Data Knowl. Eng..

[19]  A. Grafstein MIT Encyclopedia of the Cognitive Sciences , 2000 .

[20]  Ricardo Baeza-Yates,et al.  Privacy-preserving query log mining for business confidentiality protection , 2010, TWEB.

[21]  Pradipta Maji,et al.  Gene ontology based quantitative index to select functionally diverse genes , 2014, Int. J. Mach. Learn. Cybern..

[22]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[23]  David Sánchez,et al.  Semantically-grounded construction of centroids for datasets with textual attributes , 2012, Knowl. Based Syst..

[24]  Judit Bar-Ilan Position Paper: Access to Query Logs - An Academic Researcher's Point of View , 2007 .

[25]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[26]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[27]  Robert A. Wilson,et al.  Book Reviews: The MIT Encyclopedia of the Cognitive Sciences , 2000, CL.

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[29]  David Sánchez,et al.  A semantic framework to protect the privacy of electronic health records with non-numerical attributes , 2013, J. Biomed. Informatics.

[30]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[31]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[32]  Ke Wang,et al.  Privacy-enhancing personalized web search , 2007, WWW '07.

[33]  Peter Shea,et al.  Book Review: 'Click: What Millions of People Are Doing Online and Why it Matters' by Bill Tancer , 2010, ELERN.

[34]  David Sánchez,et al.  Using ontologies for structuring organizational knowledge in Home Care assistance , 2010, Int. J. Medical Informatics.

[35]  Ravi Kumar,et al.  "I know what you did last summer": query logs and user privacy , 2007, CIKM '07.

[36]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[37]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38]  James Nga-Kwok Liu,et al.  A New Method for Knowledge and Information Management Domain Ontology Graph Model , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[39]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[40]  Kent A Spackman,et al.  SNOMED CT milestones: endorsements are added to already-impressive standards credentials. , 2004, Healthcare informatics : the business magazine for information and communication systems.

[41]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[42]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[43]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[44]  Joaquin Garcia-Alfaro,et al.  Data Privacy Management and Autonomous Spontaneous Security - 5th International Workshop, DPM 2010 and 3rd International Workshop, SETOP 2010, Athens, Greece, September 23, 2010, Revised Selected Papers , 2011, DPM/SETOP.

[45]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[46]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[47]  Xin Jin,et al.  ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing , 2011, Inf. Syst..

[48]  Guillermo Navarro-Arribas,et al.  User k-anonymity for privacy preserving data mining of query logs , 2012, Inf. Process. Manag..

[49]  David Sánchez,et al.  A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge , 2012, Int. J. Semantic Web Inf. Syst..

[50]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[51]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[52]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[53]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[54]  David Sánchez,et al.  A methodology to learn ontological attributes from the Web , 2010, Data Knowl. Eng..

[55]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[56]  Cunhua Li,et al.  Event ontology reasoning based on event class influence factors , 2012, Int. J. Mach. Learn. Cybern..

[57]  David Sánchez,et al.  Privacy protection of textual attributes through a semantic-based masking method , 2012, Inf. Fusion.

[58]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[59]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.