Rewriting History: Changing the Archived Web from the Present

The Internet Archive's Wayback Machine is the largest modern web archive, preserving web content since 1996. We discover and analyze several vulnerabilities in how the Wayback Machine archives data, and then leverage these vulnerabilities to create what are to our knowledge the first attacks against a user's view of the archived web. Our vulnerabilities are enabled by the unique interaction between the Wayback Machine's archives, other websites, and a user's browser, and attackers do not need to compromise the archives in order to compromise users' views of a stored page. We demonstrate the effectiveness of our attacks through proof-of-concept implementations. Then, we conduct a measurement study to quantify the prevalence of vulnerabilities in the archive. Finally, we explore defenses which might be deployed by archives, website publishers, and the users of archives, and present the prototype of a defense for clients of the Wayback Machine, ArchiveWatcher.

[1]  Stéphane Gançarski,et al.  Improving the Quality of Web Archives through the Importance of Changes , 2011, DEXA.

[2]  Wouter Joosen,et al.  You are what you include: large-scale evaluation of remote javascript inclusions , 2012, CCS.

[3]  Nicolas Christin,et al.  Automatically Detecting Vulnerable Websites Before They Turn Malicious , 2014, USENIX Security Symposium.

[4]  Serge Egelman,et al.  Fingerprinting Web Users Through Font Metrics , 2015, Financial Cryptography.

[5]  Herbert Van de Sompel,et al.  Only One Out of Five Archived Web Pages Existed as Presented , 2015, HT.

[6]  Mark Beech Liked Mark Graham's Robots.txt meant for search engines don’t work well for web archives , 2017 .

[7]  Peter Eckersley,et al.  How Unique Is Your Web Browser? , 2010, Privacy Enhancing Technologies.

[8]  Wouter Joosen,et al.  Cookieless Monster: Exploring the Ecosystem of Web-Based Device Fingerprinting , 2013, 2013 IEEE Symposium on Security and Privacy.

[9]  Hovav Shacham,et al.  Pixel Perfect : Fingerprinting Canvas in HTML 5 , 2012 .

[10]  David Wetherall,et al.  Detecting and Defending Against Third-Party Tracking on the Web , 2012, NSDI.

[11]  Deborah R. Eltgroth Best Evidence and the Wayback Machine: Toward a Workable Authentication Standard for Archived Internet Evidence , 2009 .

[12]  Michael L. Nelson,et al.  On the Change in Archivability of Websites Over Time , 2013, TPDL.

[13]  Michael L. Nelson,et al.  Not all mementos are created equal: measuring the impact of missing resources , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[14]  R. Crudo,et al.  Using the Wayback Machine in Patent Litigation , 2014 .

[15]  Yahya Al-Hazmi,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2014, ICPP 2014.

[16]  Tadayoshi Kohno,et al.  Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016 , 2016, USENIX Security Symposium.

[17]  Jamie Murphy,et al.  Take Me Back: Validating the Wayback Machine , 2007, J. Comput. Mediat. Commun..

[18]  Karen Gazaryan Authenticity of Archived Websites: The Need to Lower the Evidentiary Hurdle Is Imminent , 2013 .

[19]  Roger Anderson,et al.  Homeland Security , 2004, Gov. Inf. Q..

[20]  Bambang Parmanto,et al.  Accessibility of Internet websites through time , 2004, Assets '04.

[21]  Michael L. Nelson,et al.  How much of the web is archived? , 2011, JCDL '11.