Anonymizing bag-valued sparse data by semantic similarity-based clustering

Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k-anonymity, which prevents the privacy attack of re-identification that reveals the real identities of vocabularies. A vocabulary is a bag of query-terms extracted from queries issued by a user at a specified granularity. Such bag-valued data are extremely sparse, which makes it hard to retain enough utility in enforcing k-anonymity. To the best of our knowledge, the prior works do not solve such a problem, among which some achieve a different privacy principle, for example, differential privacy, some deal with a different type of data, for example, set-valued data or relational data, and some consider a different publication scenario, for example, publishing frequent keywords. To retain enough data utility, a semantic similarity-based clustering approach is proposed, which measures the semantic similarity between a pair of terms by the minimum path distance over a semantic network of terms such as WordNet, computes the semantic similarity between two vocabularies by a weighted bipartite matching, and publishes the typical vocabulary for each cluster of semantically similar vocabularies. Extensive experiments on the AOL query log show that our approach can retain enough data utility in terms of loss metrics and in frequent pattern mining.

[1]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[2]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[3]  Dino Pedreschi,et al.  Anonymity preserving pattern discovery , 2008, The VLDB Journal.

[4]  Ji-Rong Wen Enhancing Web Search through Query Log Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[5]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[6]  Vijayalakshmi Atluri,et al.  Effective anonymization of query logs , 2009, CIKM.

[7]  Eugene Agichtein,et al.  Towards Privacy-Preserving Query Log Publishing , 2007 .

[8]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[9]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[11]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Vicenc Torra,et al.  Information Fusion in Data Mining , 2003 .

[13]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.

[14]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[15]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[16]  Ravi Kumar,et al.  "I know what you did last summer": query logs and user privacy , 2007, CIKM '07.

[17]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[18]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[19]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[20]  Ke Wang,et al.  Anonymizing Transaction Data by Integrating Suppression and Generalization , 2010, PAKDD.

[21]  Vicenç Torra,et al.  Towards Semantic Microaggregation of Categorical Data for Confidential Documents , 2010, MDAI.

[22]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[23]  Ashwin Machanavajjhala,et al.  Worst-Case Background Knowledge for Privacy-Preserving Data Publishing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[25]  Ricardo Baeza-Yates,et al.  Privacy-preserving query log mining for business confidentiality protection , 2010, TWEB.

[26]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[28]  Benjamin C. M. Fung,et al.  Privacy-preserving data publishing , 2007 .

[29]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[30]  Bamshad Mobasher,et al.  Web search personalization with ontological user profiles , 2007, CIKM '07.

[31]  Ashwin Machanavajjhala,et al.  Publishing Search Logs—A Comparative Study of Privacy Guarantees , 2012, IEEE Transactions on Knowledge and Data Engineering.

[32]  Yun Zhu,et al.  Anonymizing user profiles for personalized web search , 2010, WWW '10.

[33]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[34]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[35]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[36]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[37]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[38]  Bhavani M. Thuraisingham,et al.  Web and information security , 2002 .

[39]  Gökhan Tür,et al.  Sanitization and Anonymization of Document Repositories , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[40]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[41]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[42]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[43]  Ke Wang,et al.  Enforcing Vocabulary k-Anonymity by Semantic Similarity Based Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[44]  Vicenç Torra,et al.  Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[45]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[46]  Gerhard J. Woeginger,et al.  Automata, Languages and Programming , 2003, Lecture Notes in Computer Science.

[47]  Argyris Kalogeratos,et al.  Text document clustering using global term context vectors , 2011, Knowledge and Information Systems.

[48]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[49]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[50]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[51]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[52]  Richard M. Karp,et al.  Reducibility among combinatorial problems" in complexity of computer computations , 1972 .

[53]  Guillermo Navarro-Arribas,et al.  User k-anonymity for privacy preserving data mining of query logs , 2012, Inf. Process. Manag..

[54]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[55]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[56]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[57]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[58]  Guillermo Navarro-Arribas,et al.  Tree-Based Microaggregation for the Anonymization of Search Logs , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[59]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[60]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[61]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[62]  Abdus Salam,et al.  Mining top−k frequent patterns without minimum support threshold , 2010, Knowledge and Information Systems.

[63]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[64]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[65]  Javier Herranz,et al.  On the disclosure risk of multivariate microaggregation , 2008, Data Knowl. Eng..

[66]  Siti Mariyam Shamsuddin,et al.  Web Search Personalization Based on Browsing History by Artificial Immune System , 2010 .

[67]  Ingmar Weber,et al.  The demographics of web search , 2010, SIGIR.

[68]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[69]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[70]  Sandra Zilles,et al.  Query Suggestion by Query Search: A New Approach to User Support in Web Search , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[71]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[72]  Angelika Steger,et al.  Fast Algorithms for Weighted Bipartite Matching , 2005, WEA.