UbiCrawler: a scalable fully distributed Web crawler

We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Java APIs, which prompted the authors to partially reimplement them. Copyright © 2004 John Wiley & Sons, Ltd.

[1]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[2]  Robert Devine,et al.  Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm , 1993, FODO.

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[5]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  David R. Karger,et al.  Web Caching with Consistent Hashing , 1999, Comput. Networks.

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Marc Najork,et al.  Performance limitations of the Java core libraries , 1999, JAVA '99.

[10]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[11]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[12]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[13]  Sebastiano Vigna,et al.  Trovatore: Towards a Highly Scalable Distributed Web Crawler , 2001, WWW Posters.

[14]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[15]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[16]  Hongfei Yan,et al.  Architectural design and evaluation of an efficient Web-crawling system , 2002, J. Syst. Softw..

[17]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.