Archival HTTP redirection retrieval policies

When retrieving archived copies of web resources (mementos) from web archives, the original resource's URI-R is typically used as the lookup key in the web archive. This is straightforward until the resource on the live web issues a redirect: R ->R`. Then it is not clear if R or R` should be used as the lookup key to the web archive. In this paper, we report on a quantitative study to evaluate a set of policies to help the client discover the correct memento when faced with redirection. We studied the stability of 10,000 resources and found that 48% of the sample URIs tested were not stable, with respect to their status and redirection location. 27% of the resources were not perfectly reliable in terms of the number of mementos of successful responses over the total number of mementos, and 2% had a reliability score of less than 0.5. We tested two retrieval policies. The first policy covered the resources which currently issue redirects and successfully resolved 17 out of 77 URIs that did not have mementos of the original URI, but did of the resource that was being redirected to. The second policy covered archived copies with HTTP redirection and helped the client in 58% of the cases tested to discover the nearest memento to the requested datetime.

[1]  Aurelia Levi “WE” , 2014 .

[2]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[3]  Brad Tofel ‘Wayback’ for Accessing Web Archives , 2007 .

[4]  Adam Jatowt,et al.  What can history tell us?: towards different models of interaction with document histories , 2008, HT '08.

[5]  Michele Kimpton,et al.  An open source archival quality web crawler , 2004 .

[6]  Satoshi Nakamura,et al.  Journey to the past: proposal of a framework for past web browser , 2006, HYPERTEXT '06.

[7]  Anne Baillot What's new on the web , 2013 .

[8]  Stéphane Gançarski,et al.  Archiving the web using page changes patterns: a case study , 2011, JCDL '11.

[9]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[10]  Sotiris Ioannidis,et al.  we.b: the web of short urls , 2011, WWW.

[11]  Herbert Van de Sompel,et al.  Memento: Time Travel for the Web , 2009, ArXiv.

[12]  Lillian N. Cassel,et al.  Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries , 2011, JCDL 2011.

[13]  Satoshi Nakamura,et al.  A browser for browsing the past web , 2006, WWW '06.

[14]  Elna Saxton Archiving Websites: A Practical Guide for Information Management Professionals , 2007 .

[15]  Susan T. Dumais,et al.  Changing how people view changes on the web , 2009, UIST '09.

[16]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[17]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[18]  Stéphane Gançarski,et al.  Improving the Quality of Web Archives through the Importance of Changes , 2011, DEXA.

[19]  Andrew H. Mutz,et al.  Transparent Content Negotiation in HTTP , 1998, RFC.

[20]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[21]  Michael L. Nelson,et al.  How much of the web is archived? , 2011, JCDL '11.

[22]  Herbert Van de Sompel,et al.  HTTP Framework for Time-Based Access to Resource States - Memento , 2013, RFC.

[23]  Norman Paskin Digital object identifiers , 2002 .

[24]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[25]  Julien Masanès Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[26]  Adam Jatowt,et al.  Visualizing historical content of web pages , 2008, WWW.

[27]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[28]  Adam Jatowt,et al.  Detecting age of page content , 2007, WIDM '07.

[29]  Mira Dontcheva,et al.  Zoetrope: interacting with the ephemeral web , 2008, UIST '08.

[30]  Harihar Shankar,et al.  Implementing Time Travel for the Web , 2011 .

[31]  Adam Jatowt,et al.  Personalized Detection of Fresh Content and Temporal Annotation for Improved Page Revisiting , 2006, DEXA.