Load Balancing Using Consistent Hashing: A Real Challenge for Large Scale Distributed Web Crawlers

Large scale search engines nowadays use distributed Web crawlers to collect Web pages because it is impractical for a single machine to download the entire Web. Load balancing of such crawlers is an important task because of limitations in memory/resources of each crawling machine. Existing distributed crawlers use simple URL hashing based on site names as their partitioning policy. This can be done in a distributed environment using consistent hashing to dynamically manage joining and leaving of crawling nodes. This method is formally claimed to be load balanced in cases that hashing method is uniform. Given that the Web structure abides by power law distribution according to existing statistics, we argue that it is not at all possible for a uniform random hash function based on site's URL to be load balanced for case of large scale distributed Web crawlers. We show the truth of this claim by applying Web statistics to consistent hashing as it is used in one of famous Web crawlers. We also report some experimental results to demonstrate the effect of load balancing when we just rely on hash of host names.

[1]  Daniel Gomes,et al.  The Viúva Negra crawler: an experience report. Software: Practice and Experience , 2008 .

[2]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[3]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[4]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web , 2007, J. Web Eng..

[5]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[6]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[7]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[8]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[9]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[10]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[11]  Daniel Gomes,et al.  The Viúva Negra crawler: an experience report , 2008, Softw. Pract. Exp..

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[14]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[15]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.