Design and implementation of a high-performance distributed Web crawler

Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

[1]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[2]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[3]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[4]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[5]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[6]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[7]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[8]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[9]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[10]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[11]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[14]  Michael Kaufmann,et al.  Derandomizing algorithms for routing and sorting on meshes , 1994, SODA '94.

[15]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[16]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[17]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[18]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[19]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[20]  Jerome Talim,et al.  Controlling the robots of Web search engines , 2001, SIGMETRICS '01.

[21]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[22]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[23]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.