论文信息 - Lazy preservation: reconstructing websites by crawling the crawlers

Lazy preservation: reconstructing websites by crawling the crawlers

Backup of websites is often not considered until after a catastrophic event has occurred to either the website or its webmaster. We introduce "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches. Lazy preservation is especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes. We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler. Because of varying levels of completeness in any one repository, our reconstructions sampled from four different web repositories: Google (44%), MSN (30%), Internet Archive (19%) and Yahoo (7%). We also measured the time required for web resources to be discovered and cached (10-103 days) as well as how long they remained in cache after deletion (7-61 days).

Michael L. Nelson | Joan A. Smith | F. McCown

[1] James S. Plank,et al. A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[2] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[3] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[4] Amy Friedlander,et al. D-Lib Magazine: Publishing as the Honest Broker , 1998 .

[5] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[6] Michael D. Gordon,et al. Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[7] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD '00.

[8] Ming-Feng Chen,et al. A proxy-based personal web archiving service , 2001, OPSR.

[9] Vicky Reich,et al. LOCKSS: A Permanent Web Publishing and Access System , 2001, D Lib Mag..

[10] Hal Berghel. Responsible web caching , 2002, CACM.

[11] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.