ReSurf: Reconstructing web-surfing activity from network traffic

More and more applications and services move to the web and this has led to web traffic amounting to as much as 80% of all network traffic. At the same time, most traffic classification efforts stop once they correctly label a flow as web or HTTP. In this paper, we focus on understanding what happens “under the hood” of HTTP traffic. Our first contribution is ReSurf, a systematic approach to reconstruct web-surfing activity starting from raw network data with more than 91% recall and 95% precision over four real network traces. Our second contribution is an extensive analysis of web activity across these traces. By utilizing ReSurf, we study web-surfing behaviors in terms of user requests and transitions between websites (e.g. the click-through history of following hyperlinks). A surprising result is the prevalence of advertising and tracking services that are being accessed during web-surfing that are without the user's explicit consent. In our traces, we found that with 90% chance a user will access such a service after just three user requests (or “clicks”). We believe that our methodology and findings provide valuable insights into modern traffic that can allow: (a) network administrators to better manage and protect their networks, (b) traffic regulators to protect the rights of on-line users, and (c) researchers to better understand the evolution of the traffic from modern websites.

[1]  Wolfgang Mühlbauer,et al.  Digging into HTTPS: flow-based classification of webmail traffic , 2010, IMC '10.

[2]  Evangelos P. Markatos,et al.  One-click hosting services: a file-sharing hideout , 2009, IMC '09.

[3]  Anja Feldmann,et al.  Pitfalls in HTTP Traffic Measurements and Analysis , 2012, PAM.

[4]  Balachander Krishnamurthy,et al.  Generating a privacy footprint on the internet , 2006, IMC '06.

[5]  Wolfgang Mühlbauer,et al.  Web content cartography , 2011, IMC '11.

[6]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[7]  Wei Li,et al.  Classifying HTTP Traffic in the New Age , 2008, SIGCOMM 2008.

[8]  Hailin Wu,et al.  Hidden surveillance by Web sites: Web bugs in contemporary use , 2003, CACM.

[9]  Anja Feldmann,et al.  Understanding online social network usage from a network perspective , 2009, IMC '09.

[10]  Farnam Jahanian,et al.  Internet inter-domain traffic , 2010, SIGCOMM '10.

[11]  Vivek S. Pai,et al.  Towards understanding modern web traffic , 2011, SIGMETRICS '11.

[12]  Vyas Sekar,et al.  Understanding website complexity: measurements, metrics, and implications , 2011, IMC '11.

[13]  Jeffrey Erman,et al.  HTTP in the home: it is not just about PCs , 2010, HomeNets@SIGCOMM.

[14]  Florian Haemmerling Unconstrained Endpoint Profiling (Googling the Internet) , 2009 .

[15]  Anja Feldmann,et al.  Web search clickstreams , 2006, IMC '06.

[16]  Kevin Jeffay,et al.  What TCP/IP protocol headers can tell us about the web , 2001, SIGMETRICS '01.

[17]  Virgílio A. F. Almeida,et al.  Characterizing user behavior in online social networks , 2009, IMC '09.

[18]  Qiang Xu,et al.  Identifying diverse usage behaviors of smartphone apps , 2011, IMC '11.

[19]  Balachander Krishnamurthy,et al.  WWW 2009 MADRID! Track: Security and Privacy / Session: Web Privacy Privacy Diffusion on the Web: A Longitudinal Perspective , 2022 .

[20]  Marco Mellia,et al.  DNS to the rescue: discerning content and services in a tangled web , 2012, IMC '12.

[21]  Bruce A. Mah,et al.  An empirical model of HTTP network traffic , 1997, Proceedings of INFOCOM '97.

[22]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.