PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

Most of the current web crawlers use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions are known to have problems like link congestion, being a single point of failure, and expensive administration.

[1]  Mudhakar Srivatsa,et al.  Scaling unstructured peer-to-peer networks with multi-tier capacity-aware overlay topologies , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[4]  Boon Thau Loo,et al.  Distributed Web Crawling over DHTs , 2004 .

[5]  Mudhakar Srivatsa,et al.  Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web , 2003, Distributed Multimedia Information Retrieval.

[6]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[8]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.