论文信息 - Parallel crawlers - 字舞流文

Parallel crawlers

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

Hector Garcia-Molina | Junghoo Cho | H. Garcia-Molina | Junghoo Cho

[1] George Kingsley Zipf,et al. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[2] Daniel S. Hirschberg,et al. Parallel algorithms for the transitive closure and the connected component problems , 1976, STOC '76.

[3] Sartaj Sahni,et al. Parallel permutation and sorting algorithms and a new generalized connection network , 1982, JACM.

[4] Michael J. Quinn,et al. Parallel graph algorithms , 1984, CSUR.

[5] Andrew S. Tanenbaum,et al. Distributed operating systems , 2009, CSUR.

[6] Paul Hudak,et al. Memory coherence in shared virtual memory systems , 1989, TOCS.

[7] Patrick Valduriez,et al. Principles of Distributed Database Systems , 1990 .

[8] Mahadev Satyanarayanan,et al. Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[9] David Eichmann,et al. The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[10] Oliver A. McBryan,et al. GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[11] B. Pinkerton,et al. Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[12] M. Koster,et al. Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[13] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[15] Zhen Liu,et al. Optimal Robot Scheduling for Web Search Engines , 1998 .

[16] Krishna Bharat,et al. SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[17] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[18] Albert,et al. Emergence of scaling in random networks , 1999, Science.

[19] Hector Garcia-Molina,et al. Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[20] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[21] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..

[22] Andrei Z. Broder,et al. Graph structure in the Web , 2000, Comput. Networks.

[23] Hector Garcia-Molina,et al. The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[24] Marc Najork,et al. High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[25] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.