User 4XXXXX9: Anonymizing Query Logs

The recent release of the American Online (AOL) Query Logs highlighted the remarkable amount of private and identifying information that users are willing to reveal to a search engine. The release of these types of log files therefore represents a significant liability and compromise of user privacy. However, without such data the academic community greatly suffers in their ability to conduct research on real search engines. This paper proposes two specific solutions (rather than an overly general framework) that attempts to balance the needs of certain types of research while individual privacy. The first solution, based on a threshold cryptography system, eliminates highly identifying queries, in real time, without preserving history or statistics about previous behavior. The second solution attempts to deal with sets of queries, that when taken in aggregate, are overly identifying. Both are novel and represent additional options for data anonymization.

[1]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[2]  Hugo Liu,et al.  Of Men, Women, and Computers: Data-driven Gender Modeling for Improved User Interfaces , 2022 .

[3]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[4]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[5]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[6]  Balaji Padmanabhan,et al.  Clickprints on the Web: Are There Signatures in Web Browsing Data? , 2007 .

[7]  Berry Schoenmakers,et al.  A Simple Publicly Verifiable Secret Sharing Scheme and Its Application to Electronic , 1999, CRYPTO.

[8]  Ulrich Flegel Pseudonymizing Unix Log Files , 2002, InfraSec.

[9]  Brent Waters,et al.  Building an Encrypted and Searchable Audit Log , 2004, NDSS.

[10]  Martin F. Arlitt,et al.  SC2D: an alternative to trace anonymization , 2006, MineNet '06.

[11]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[12]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[13]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[14]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[15]  William Yurcik,et al.  Sharing computer network logs for security and privacy: a motivation for new methodologies of anonymization , 2005, Workshop of the 1st International Conference on Security and Privacy for Emerging Areas in Communication Networks, 2005..

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Karl N. Levitt,et al.  How to sanitize data? , 2004, 13th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises.

[18]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[19]  Jaime Teevan,et al.  History repeats itself: repeat queries in Yahoo's logs , 2006, SIGIR '06.