Evaluation of crawling policies for a web-repository crawler

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.

[1]  Sang Ho Lee,et al.  On URL Normalization , 2005, ICCSA.

[2]  Michalis Vazirgiannis,et al.  Archiving the Greek Web , 2004 .

[3]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[4]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[6]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[7]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[8]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[9]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[10]  Michael L. Nelson,et al.  Observed Web Robot Behavior on Decaying Web Subsites , 2006, D Lib Mag..

[11]  John Garrett,et al.  Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. , 1996 .

[12]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[13]  Rick Bennett,et al.  Trends in the Evolution of the Public Web: 1998 - 2002 , 2003, D Lib Mag..

[14]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[15]  Z. Dalai,et al.  Managing distributed collections: evaluating Web page changes, movement, and replacement , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[16]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[17]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[19]  Michael L. Nelson,et al.  Just-in-time recovery of missing web pages , 2006, HYPERTEXT '06.

[20]  Herbert Van de Sompel,et al.  mod_oai: An Apache Module for Metadata Harvesting , 2005, ECDL.

[21]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[22]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[23]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[24]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[25]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[26]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[27]  David W. Embley,et al.  On the Automatic Extraction of Data from the Hidden Web , 2001, ER.

[28]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[29]  Marián Boguñá,et al.  Decoding the structure of the WWW: facts versus sampling biases , 2005, ArXiv.

[30]  Catherine C. Marshall,et al.  Saving private hypertext: requirements and pragmatic dimensions for preservation , 2004, HYPERTEXT '04.

[31]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[32]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[33]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[34]  Daniel Gomes,et al.  Characterizing a national community web , 2005, TOIT.

[35]  D. M. Hutton,et al.  Web Dynamics - Adapting to Change in Content, Size, Topology and Use , 2006 .

[36]  Sougata Mukherjea,et al.  Organizing topic-specific web information , 2000, HYPERTEXT '00.

[37]  Johan Bollen,et al.  Distributed, real-time computation of community preferences , 2005, HYPERTEXT '05.

[38]  John R. Garrett,et al.  Task Force on Archiving of Digital Information , 1995, D Lib Mag..

[39]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[40]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[41]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[42]  Johan Bollen,et al.  Reconstructing Websites for the Lazy Webmaster , 2005, ArXiv.

[43]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[44]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[45]  Chabane Djeraba,et al.  High performance crawling system , 2004, MIR '04.

[46]  Vivian Cothey,et al.  Web-crawling reliability , 2004, J. Assoc. Inf. Sci. Technol..

[47]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.