Collaborative Search Log Sanitization: Toward Differential Privacy and Boosted Utility

Severe privacy leakage in the AOL search log incident has attracted considerable worldwide attention. However, all the web users' daily search intents and behavior are collected in such data, which can be invaluable for researchers, data analysts and law enforcement personnel to conduct social behavior study [14], criminal investigation [5] and epidemics detection [10]. Thus, an important and challenging research problem is how to sanitize search logs with strong privacy guarantee and sufficiently retained utility. Existing approaches in search log sanitization are capable of only protecting the privacy under a rigorous standard [24] or maintaining good output utility [25] . To the best of our knowledge, there is little work that has perfectly resolved such tradeoff in the context of search logs, meeting a high standard of both requirements. In this paper, we propose a sanitization framework to tackle the above issue in a distributed manner. More specifically, our framework enables different parties to collaboratively generate search logs with boosted utility while satisfying Differential Privacy. In this scenario, two privacy-preserving objectives arise: first, the collaborative sanitization should satisfy differential privacy; second, the collaborative parties cannot learn any private information from each other. We present an efficient protocol -Collaborative sEarch Log Sanitization (CELS) to meet both privacy requirements. Besides security/privacy and cost analysis, we demonstrate the utility and efficiency of our approach with real data sets.

[1]  Stephen C. Pohlig,et al.  An Improved Algorithm for Computing Logarithms over GF(p) and Its Cryptographic Significance , 2022, IEEE Trans. Inf. Theory.

[2]  Martin E. Hellman,et al.  An improved algorithm for computing logarithms over GF(p) and its cryptographic significance (Corresp.) , 1978, IEEE Trans. Inf. Theory.

[3]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[4]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[5]  Jacques Stern,et al.  A new public key cryptosystem based on higher residues , 1998, CCS '98.

[6]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[9]  Oded Goldreich Foundations of Cryptography: Encryption Schemes , 2004 .

[10]  Sheng Zhong,et al.  Privacy-enhancing k-anonymization of customer data , 2005, PODS.

[11]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[12]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[13]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[14]  Chris Clifton,et al.  A secure distributed framework for achieving k-anonymity , 2006, The VLDB Journal.

[15]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[16]  Georges Dupret,et al.  Automatic Query Recommendation using Click-Through Data , 2006, IFIP PPAI.

[17]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[18]  Hamdy A. Taha,et al.  Operations research: an introduction / Hamdy A. Taha , 1982 .

[19]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[20]  Ravi Kumar,et al.  "I know what you did last summer": query logs and user privacy , 2007, CIKM '07.

[21]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[22]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[23]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[24]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[25]  Hongbo Deng,et al.  Entropy-biased models for query representation on the click graph , 2009, SIGIR.

[26]  David D. Jensen,et al.  Accurate Estimation of the Degree Distribution of Private Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[27]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[28]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[29]  Vijayalakshmi Atluri,et al.  Effective anonymization of query logs , 2009, CIKM.

[30]  Benjamin C. M. Fung,et al.  Anonymity meets game theory: secure data integration with malicious participants , 2011, The VLDB Journal.

[31]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[32]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[33]  Yvo Desmedt,et al.  Encryption Schemes , 1999, Algorithms and Theory of Computation Handbook.

[34]  Ke Wang,et al.  Enforcing Vocabulary k-Anonymity by Semantic Similarity Based Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[35]  Basit Shafiq,et al.  Privacy-Preserving Tabu Search for Distributed Graph Coloring , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[36]  Jaideep Vaidya,et al.  Efficient Distributed Linear Programming with Limited Disclosure , 2011, DBSec.

[37]  James Allan,et al.  CrowdLogging: distributed, private, and anonymous search logging , 2011, SIGIR '11.

[38]  Benjamin C. M. Fung,et al.  m-Privacy for collaborative data publishing , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[39]  Ashwin Machanavajjhala,et al.  Publishing Search Logs—A Comparative Study of Privacy Guarantees , 2012, IEEE Transactions on Knowledge and Data Engineering.

[40]  Aaron Roth,et al.  Beating randomized response on incoherent matrices , 2011, STOC '12.

[41]  W. Marsden I and J , 2012 .

[42]  Jaideep Vaidya,et al.  Secure and efficient distributed linear programming , 2012, J. Comput. Secur..

[43]  Benjamin C. M. Fung,et al.  Secure Distributed Framework for Achieving ε-Differential Privacy , 2012, Privacy Enhancing Technologies.

[44]  Olvi L. Mangasarian Privacy-preserving horizontally partitioned linear programs , 2012, Optim. Lett..

[45]  Jaideep Vaidya,et al.  Differentially private search log sanitization with optimal output utility , 2011, EDBT '12.

[46]  Ninghui Li,et al.  On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy , 2011, ASIACCS '12.

[47]  Nina Mishra,et al.  Privacy via the Johnson-Lindenstrauss Transform , 2012, J. Priv. Confidentiality.

[48]  Privacy-preserving collaborative optimization , 2013 .

[49]  Wei Li,et al.  Privacy-preserving horizontally partitioned linear programs with inequality constraints , 2013, Optim. Lett..

[50]  S. Rajsbaum Foundations of Cryptography , 2014 .

[51]  Jaideep Vaidya,et al.  A Survey of Privacy-Aware Supply Chain Collaboration: From Theory to Applications , 2014, J. Inf. Syst..

[52]  Jaideep Vaidya,et al.  Collaboratively Solving the Traveling Salesman Problem with Limited Disclosure , 2014, DBSec.

[53]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[54]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2016, J. Priv. Confidentiality.