Parallel crawlers

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

[1]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[2]  Daniel S. Hirschberg,et al.  Parallel algorithms for the transitive closure and the connected component problems , 1976, STOC '76.

[3]  Sartaj Sahni,et al.  Parallel permutation and sorting algorithms and a new generalized connection network , 1982, JACM.

[4]  Michael J. Quinn,et al.  Parallel graph algorithms , 1984, CSUR.

[5]  Andrew S. Tanenbaum,et al.  Distributed operating systems , 2009, CSUR.

[6]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[7]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[8]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[9]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[10]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[11]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[12]  M. Koster,et al.  Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[15]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[16]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[17]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[18]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[19]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[20]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[21]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[22]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[23]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[24]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[25]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.