Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling

Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead.

[1]  Jean-Loup Guillaume,et al.  Efficient and Simple Encodings for the Web Graph , 2002, WAIM.

[2]  Qi Lu,et al.  Collaborative Web crawling: information gathering/processing over Internet , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[3]  Berkant Barla Cambazoglu,et al.  Data-Parallel Web Crawling Models , 2004, ISCIS.

[4]  José Rufino,et al.  Efficient Partitioning Strategies for Distributed Web Crawling , 2007, ICOIN.

[5]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[7]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[8]  George Karypis,et al.  Multilevel Hypergraph Partitioning , 2003 .

[9]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[10]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[11]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[12]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[13]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[14]  G. Karypis,et al.  Multilevel k-way hypergraph partitioning , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[15]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[16]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[17]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.