Lazy preservation: reconstructing websites by crawling the crawlers

Backup of websites is often not considered until after a catastrophic event has occurred to either the website or its webmaster. We introduce "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches. Lazy preservation is especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes. We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler. Because of varying levels of completeness in any one repository, our reconstructions sampled from four different web repositories: Google (44%), MSN (30%), Internet Archive (19%) and Yahoo (7%). We also measured the time required for web resources to be discovered and cached (10-103 days) as well as how long they remained in cache after deletion (7-61 days).

[1]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[2]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Amy Friedlander,et al.  D-Lib Magazine: Publishing as the Honest Broker , 1998 .

[5]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[6]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[7]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[8]  Ming-Feng Chen,et al.  A proxy-based personal web archiving service , 2001, OPSR.

[9]  Vicky Reich,et al.  LOCKSS: A Permanent Web Publishing and Access System , 2001, D Lib Mag..

[10]  Hal Berghel Responsible web caching , 2002, CACM.

[11]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[12]  Hector Garcia-Molina,et al.  InfoMonitor: unobtrusively archiving a World Wide Web server , 2005, International Journal on Digital Libraries.

[13]  Michael Day,et al.  Collecting and preserving the world wide web , 2003 .

[14]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[15]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[16]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[17]  Website navigation architectures and their effect on website visibility: a literature survey , 2004 .

[18]  Curtis E. Dyreson,et al.  Managing versions of web documents in a transaction-time web server , 2004, WWW '04.

[19]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[20]  Johan Bollen,et al.  Reconstructing Websites for the Lazy Webmaster , 2005, ArXiv.

[21]  Jin Zhang,et al.  The impact of webpage content characteristics on webpage visibility in search engine results (Part I) , 2005, Inf. Process. Manag..

[22]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[23]  Dirk Lewandowski,et al.  The freshness of web search engine databases , 2006, J. Inf. Sci..

[24]  Mohammad Zubair,et al.  Search engine coverage of the OAI-PMH corpus , 2006, IEEE Internet Computing.

[25]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.