What are you Googling? - Inferring search type information through a statistical classifier

Privacy in communications calls primarily for information flow encryption. Packet traffic flows privacy breaches have been widely demonstrated in point-to-point communications due to information leakage from observable traffic features, like packet length, timestamp, direction. We address a point-to multipoint system, namely a Content Delivery Network, where user clients maintain and use connections with a number of servers. Specifically, we address Google search services: they are conveyed by TLS connections, by using https, either from within user accounts or even without logging as a Google services user. Https is provided to protect communications privacy. Yet, we show that by collecting the encrypted traffic and extracting simple features related to traffic activity and possibly the amount of data sent by servers to clients, effective classifiers of user activity can be realized. Specifically, we are able to distinguish which type of search a user is carrying out, among a given set of alternatives (text, images, maps, video, video on YouTube, news) with average success rates that can exceed 90%.

[1]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[2]  Fiach Reid Network programming in .NET: C# & Visual Basic .NET , 2004 .

[3]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[4]  Bruce Schneier,et al.  Analysis of the SSL 3.0 protocol , 1996 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Antonio Pescapè,et al.  Traffic classification and its applications to modern networks , 2009, Comput. Networks.

[7]  Judith Kelner,et al.  A Survey on Internet Traffic Identification , 2009, IEEE Communications Surveys & Tutorials.

[8]  David D. Jensen,et al.  Privacy Vulnerabilities in Encrypted HTTP Streams , 2005, Privacy Enhancing Technologies.

[9]  Charles V. Wright,et al.  Spot Me if You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[10]  Antonio Pescapè,et al.  Issues and future directions in traffic classification , 2012, IEEE Network.

[11]  Rui Wang,et al.  Side-Channel Leaks in Web Applications: A Reality Today, a Challenge Tomorrow , 2010, 2010 IEEE Symposium on Security and Privacy.

[12]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[13]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[14]  Andrew Hintz,et al.  Fingerprinting Websites Using Traffic Analysis , 2002, Privacy Enhancing Technologies.

[15]  Dawn Xiaodong Song,et al.  Timing Analysis of Keystrokes and Timing Attacks on SSH , 2001, USENIX Security Symposium.

[16]  Fabian Monrose,et al.  Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on Fon-iks , 2011, 2011 IEEE Symposium on Security and Privacy.

[17]  Lili Qiu,et al.  Statistical identification of encrypted Web browsing traffic , 2002, Proceedings 2002 IEEE Symposium on Security and Privacy.

[18]  Arian Bär,et al.  IP mining: Extracting knowledge from the dynamics of the Internet addressing space , 2013, Proceedings of the 2013 25th International Teletraffic Congress (ITC).

[19]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[20]  Pedro Casas,et al.  Mini-IPC: A minimalist approach for HTTP traffic classification using IP addresses , 2013, 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC).

[21]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[22]  Tadayoshi Kohno,et al.  Devices That Tell on You: Privacy Trends in Consumer Ubiquitous Computing , 2007, USENIX Security Symposium.