Distributed Web Crawling over DHTs

In this paper, we present the design and implementation of a distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. It harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl are use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on PlanetLab querying

[1]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[2]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[6]  Jeffrey D. Ullman,et al.  A survey of deductive database systems , 1995, J. Log. Program..

[7]  Richard M. Karp,et al.  Load Balancing in Structured P2P Systems , 2003, IPTPS.

[8]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[9]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[10]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[11]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[12]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[13]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[14]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Zhichen Xu,et al.  pSearch: information retrieval in structured overlays , 2003, CCRV.

[16]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[17]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[18]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .