The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don’t require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.

[1]  Michalis Faloutsos,et al.  Jellyfish: A conceptual model for the as Internet topology , 2006, Journal of Communications and Networks.

[2]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[3]  Edgar R. Weippl,et al.  Block Me If You Can: A Large-Scale Study of Tracker-Blocking Tools , 2017, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[4]  C. Michael Sperberg-McQueen,et al.  World Wide Web Consortium , 2009, Encyclopedia of Database Systems.

[5]  Nikita Borisov,et al.  The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors , 2018, CCS.

[6]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[7]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[8]  Evangelos P. Markatos,et al.  Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask , 2018, WWW.

[9]  Dolière Francis Somé,et al.  On the Content Security Policy Violations due to the Same-Origin Policy , 2016, WWW.

[10]  Benjamin Livshits,et al.  SpeedReader: Reader Mode Made Fast and Private , 2018, WWW.

[11]  Rachel Greenstadt,et al.  How Unique is Your .onion?: An Analysis of the Fingerprintability of Tor Onion Services , 2017, CCS.

[12]  FaloutsosMichalis,et al.  On power-law relationships of the Internet topology , 1999 .

[13]  Sebastiano Vigna,et al.  The Graph Structure in the Web - Analyzed on Different Aggregation Levels , 2015, J. Web Sci..

[14]  Jens Myrup Pedersen,et al.  Kraaler: A User-Perspective Web Crawler , 2019, 2019 Network Traffic Measurement and Analysis Conference (TMA).

[15]  John Heidemann,et al.  Precise Detection of Content Reuse in the Web , 2019, CCRV.

[16]  Frank Piessens,et al.  FPDetective: dusting the web for fingerprinters , 2013, CCS.

[17]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[18]  Arvind Narayanan,et al.  The Web Never Forgets: Persistent Tracking Mechanisms in the Wild , 2014, CCS.

[19]  Walter Rudametkin,et al.  Beauty and the Beast: Diverting Modern Web Browsers to Build Unique Browser Fingerprints , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[20]  Nick Feamster,et al.  Watching You Watch: The Tracking Ecosystem of Over-the-Top TV Streaming Devices , 2019, CCS.

[21]  Kevin Jeffay,et al.  Tracking the evolution of Web traffic: 1995-2003 , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[22]  Bernhard Ager,et al.  An Automated Approach for Complementing Ad Blockers’ Blacklists , 2015, Proc. Priv. Enhancing Technol..

[23]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[24]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[25]  Wouter Joosen,et al.  Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation , 2018, NDSS.

[26]  Wouter Joosen,et al.  Cookieless Monster: Exploring the Ecosystem of Web-Based Device Fingerprinting , 2013, 2013 IEEE Symposium on Security and Privacy.

[27]  Wouter Joosen,et al.  Evaluating the Long-term Effects of Parameters on the Characteristics of the Tranco Top Sites Ranking , 2019, CSET @ USENIX Security Symposium.

[28]  Cédric Lauradoux,et al.  Security Analysis of Subject Access Request Procedures - How to Authenticate Data Subjects Safely When They Request for Their Data , 2019, APF.

[29]  Claude Castelluccia,et al.  On the uniqueness of Web browsing history patterns , 2014, Ann. des Télécommunications.

[30]  Manish Kumar,et al.  A survey of Web crawlers for information retrieval , 2017, WIREs Data Mining Knowl. Discov..

[31]  Hao Wu,et al.  An early warning system for unrecognized drug side effects discovery , 2012, WWW.

[32]  Arvind Narayanan,et al.  Online Tracking: A 1-million-site Measurement and Analysis , 2016, CCS.

[33]  Bill Fitzgerald,et al.  Tracking the Trackers , 2016 .

[34]  Athina Markopoulou,et al.  NoMoAds: Effective and Efficient Cross-App Mobile Ad-Blocking , 2018, Proc. Priv. Enhancing Technol..

[35]  Konstantina Papagiannaki,et al.  Like a Pack of Wolves: Community Structure of Web Trackers , 2016, PAM.

[36]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[37]  Jérôme Kunegis,et al.  On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl , 2016, J. Web Sci..

[38]  Steven M. Bellovin,et al.  A Privacy Analysis of Cross-device Tracking , 2017, USENIX Security Symposium.

[39]  Wenke Lee,et al.  The Price of Free: Privacy Leakage in Personalized Mobile In-Apps Ads , 2016, NDSS.

[40]  Sandrine Vaton,et al.  Web View: Measuring & Monitoring Representative Information on Websites , 2019, 2019 22nd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN).

[41]  Peter Eckersley,et al.  How Unique Is Your Web Browser? , 2010, Privacy Enhancing Technologies.

[42]  Wouter Joosen,et al.  Mobile Friendly or Attacker Friendly?: A Large-scale Security Evaluation of Mobile-first Websites , 2019, AsiaCCS.

[43]  Panagiotis Takis Metaxas,et al.  Why Is the Shape of the Web a Bowtie , 2012 .

[44]  Claude Castelluccia,et al.  MyAdChoices: Bringing Transparency and Control to Online Advertising , 2016, ACM Trans. Web.