论文信息 - A Novel Architecture for Domain Specific Parallel Crawler

A Novel Architecture for Domain Specific Parallel Crawler

The World Wide Web is an interlinked collection of billions of documents formatted using HTML. Due to the growing and dynamic nature of the web, it has become a challenge to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process. The crawler process is further being parallelized in the form ecology of crawler workers that parallely download information from the web. This paper proposes a novel architecture of parallel crawler, which is based on domain specific crawling, makes crawling task more effective, scalable and load-sharing among the different crawlers which parallel download web pages related to different domains specific URLs.

Nidhi Tyagi | Deepti Gupta

[1] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[2] Sang Ho Lee,et al. Scrawler: A Seed-By-Seed Parallel Web Crawler , 2007, ICE-B.

[3] J. P. Gupta,et al. Parallel crawler architecture and web page change detection , 2008 .

[4] A. K. Sharma,et al. Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[5] Divakar Yadav,et al. Topical web crawling using weighted anchor text and web page change detection techniques , 2009 .

[6] Marc Najork,et al. Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[7] Ling Zhang,et al. A Parallel Crawling Schema Using Dynamic Partition , 2004, International Conference on Computational Science.

[8] Shi-Jen Lin,et al. Parallel Crawling and Capturing for On-Line Auction , 2008, ISI Workshops.

[9] Sriram Raghavan,et al. Stanford WebBase components and applications , 2006, TOIT.

[10] Christos Faloutsos,et al. Parallel crawling for online social networks , 2007, WWW '07.

[11] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[12] Indian Journal of Computer Science and Engineering , 2022 .

[13] Hector Garcia-Molina,et al. Parallel crawlers , 2002, WWW.