"I know what you did last summer": query logs and user privacy

We investigate the subtle cues to user identity that may be exploited in attacks on the privacy of users in web search query logs. We study the application of simple classifiers to map a sequence of queries into the gender, age, and location of the user issuing the queries. We then show how these classifiers may be carefully combined at multiple granularities to map a sequence of queries into a set of candidate users that is 300-600 times smaller than random chance would allow. We show that this approach remains accurate even after removing personally identifiable information such as names/numbers or limiting the size of the query log. We also present a new attack in which a real-world acquaintance of a user attempts to identify that user in a large query log, using personal information. We show that combinations of small pieces of information about terms a user would probably search for can be highly effective in identifying the sessions of that user. We conclude that known schemes to release even heavily scrubbed query logs that contain session information have significant privacy risks.

[1]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[2]  Jasmine Novak,et al.  Anti-aliasing on the web , 2004, WWW '04.

[3]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[4]  Laura Mayfield Tomokiyo,et al.  You’re Not From ’Round Here, Are You? Naive Bayes Detection of Non-Native Utterances , 2001, NAACL.

[5]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[6]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[7]  Wei Vivian Zhang,et al.  Geographic intention and modification in web search , 2008, Int. J. Geogr. Inf. Sci..

[8]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[9]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[10]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[11]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[12]  John Riedl,et al.  You are what you say: privacy risks of public mentions , 2006, SIGIR '06.

[13]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[14]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..