Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation

In order to evaluate the prevalence of security and privacy practices on a representative sample of the Web, researchers rely on website popularity rankings such as the Alexa list. While the validity and representativeness of these rankings are rarely questioned, our findings show the contrary: we show for four main rankings how their inherent properties (similarity, stability, representativeness, responsiveness and benignness) affect their composition and therefore potentially skew the conclusions made in studies. Moreover, we find that it is trivial for an adversary to manipulate the composition of these lists. We are the first to empirically validate that the ranks of domains in each of the lists are easily altered, in the case of Alexa through as little as a single HTTP request. This allows adversaries to manipulate rankings on a large scale and insert malicious domains into whitelists or bend the outcome of research studies to their will. To overcome the limitations of such rankings, we propose improvements to reduce the fluctuations in list composition and guarantee better defenses against manipulation. To allow the research community to work with reliable and reproducible rankings, we provide Tranco, an improved ranking that we offer through an online service available at this https URL.

[1]  Georg Carle,et al.  Structure and Stability of Internet Top Lists , 2018, ArXiv.

[2]  An Zeng,et al.  Robustness of centrality measures against network manipulation , 2015 .

[3]  Yinqian Zhang,et al.  OS-level Side Channels without Procfs: Exploring Cross-App Information Leakage on iOS , 2018, NDSS.

[4]  Ning Shi,et al.  Graffiti Networks: A Subversive, Internet-Scale File Sharing Model , 2011, ArXiv.

[5]  D. Dittrich,et al.  The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research , 2012 .

[6]  Stelvio Cimato,et al.  Secure metering schemes , 2010 .

[7]  Javier Aracil,et al.  DNS weighted footprints for web browsing analytics , 2018, J. Netw. Comput. Appl..

[8]  Narseo Vallina-Rodriguez,et al.  A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists , 2018, Internet Measurement Conference.

[9]  Wouter Joosen,et al.  Herding Vulnerable Cats: A Statistical Approach to Disentangle Joint Responsibility for Web Security in Shared Hosting , 2017, CCS.

[10]  Bernard Grofman,et al.  The Borda Count and its real-world alternatives: Comparing scoring rules in Nauru and Slovenia , 2014 .

[11]  Jérôme Kunegis,et al.  On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl , 2016, J. Web Sci..

[12]  Damon McCoy,et al.  Stress Testing the Booters: Understanding and Undermining the Business of DDoS Services , 2016, WWW.

[13]  Timothy Libert,et al.  An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies , 2018, WWW.

[14]  Wouter Joosen,et al.  You are what you include: large-scale evaluation of remote javascript inclusions , 2012, CCS.

[15]  Christopher Krügel,et al.  MineSweeper: An In-depth Look into Drive-by Cryptocurrency Mining and Its Defense , 2018, CCS.

[16]  Adrienne Porter Felt,et al.  Measuring HTTPS Adoption on the Web , 2017, USENIX Security Symposium.

[17]  Wouter Joosen,et al.  Parking Sensors: Analyzing and Detecting Parked Domains , 2015, NDSS.

[18]  Christopher Krügel,et al.  Meerkat: Detecting Website Defacements through Image-based Object Recognition , 2015, USENIX Security Symposium.

[19]  Bruce W.N. Lo HOW RELIABLE ARE WEBSITE RANKINGS? IMPLICATIONS FOR E-BUSINESS ADVERTISING AND INTERNET SEARCH , 2006 .

[20]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[21]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[22]  Deepak Kumar,et al.  Security Challenges in an Increasingly Tangled Web , 2017, WWW.

[23]  Fan Long,et al.  Principled Sampling for Anomaly Detection , 2015, NDSS.

[24]  Steven M. Bellovin,et al.  A Privacy Analysis of Cross-device Tracking , 2017, USENIX Security Symposium.

[25]  Giovane C. M. Moura,et al.  Domain names abuse and TLDs: From monetization towards mitigation , 2017, 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM).

[26]  Elie Bursztein,et al.  Cloak of Visibility: Detecting When Machines Browse a Different Web , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[27]  Mario Callegaro,et al.  Internet and mobile ratings panels , 2014 .

[28]  Michael Sirivianos,et al.  Aiding the Detection of Fake Accounts in Large Scale Social Online Services , 2012, NSDI.

[29]  William K. Robertson,et al.  Surveylance: Automatically Detecting Online Survey Scams , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[30]  Arvind Narayanan,et al.  Online Tracking: A 1-million-site Measurement and Analysis , 2016, CCS.

[31]  Nick Feamster,et al.  The Effect of DNS on Tor's Anonymity , 2016, NDSS.

[32]  Divyakant Agrawal,et al.  On Hit Inflation Techniques and Detection in Streams of Web Advertising Networks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[33]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[34]  Liwen Vaughan A New Frontier of Informetric and Webometric Research: Mining Web Usage Data , 2008 .

[35]  Andreas Terzis,et al.  Peeking Through the Cloud: DNS-Based Estimation and Its Applications , 2008, ACNS.

[36]  Wouter Joosen,et al.  Automated Website Fingerprinting through Deep Learning , 2017, NDSS.

[37]  Patrick D. McDaniel,et al.  Domain-Z: 28 Registrations Later Measuring the Exploitation of Residual Trust in Domains , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[38]  Timothy Libert,et al.  Exposing the Hidden Web: An Analysis of Third-Party HTTP Requests on 1 Million Websites , 2015, ArXiv.

[39]  Tadayoshi Kohno,et al.  Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016 , 2016, USENIX Security Symposium.

[40]  Maciej Korczynski,et al.  Apples, oranges and hosting providers: Heterogeneity and security in the hosting market , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[41]  Arvind Narayanan,et al.  The Web Never Forgets: Persistent Tracking Mechanisms in the Wild , 2014, CCS.

[42]  Michael Sonntag,et al.  DNS Traffic of a Tor Exit Node - An Analysis , 2018, SpaCCS.

[43]  Mark Allman,et al.  Ethical considerations in network measurement papers , 2016, Commun. ACM.

[44]  Sebastiano Vigna,et al.  The Graph Structure in the Web - Analyzed on Different Aggregation Levels , 2015, J. Web Sci..

[45]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.