Dominos: A New Web Crawler's Design

Today’s search engines are equipped with specialized agents known as Web crawlers (download robots) dedicated to crawling large Web contents on line. These contents are then analyzed, indexed and made available to users. Crawlers interact with thousands of Web servers over periods extending from a few weeks to several years. This type of crawling process therefore means that certain judicious criteria need to be taken into account, such as the robustness, exibilit y and maintainability of these crawlers. In the present paper, we will describe the design and implementation of a realtime distributed system of Web crawling running on a cluster of machines. The system crawls several thousands of pages every second, includes a high-performance fault manager, is platform independent and is able to adapt transparently to a wide range of congurations without incurring additional hardware expenditure. We will then provide details of the system architecture and describe the technical choices for very high performance crawling. Finally, we will discuss the experimental results obtained, comparing them with other documented systems.

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[3]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[4]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[5]  Jerome Talim,et al.  Controlling the robots of Web search engines , 2001, SIGMETRICS '01.

[6]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[7]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[8]  Giles,et al.  Searching the world wide Web , 1998, Science.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[11]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[12]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[13]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[14]  Byron Anderson Archiving the Internet , 2005 .

[15]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[16]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[17]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[19]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.