A Novel Architecture of a Parallel Web Crawler

ABSTRACT Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture is to efficiently and effectively crawl the current set of publically indexable web pages so that we can maximize the download rate while minimizing the overhead from parallelization Keywords WWW, Search Engines, Crawlers, Parallel Crawlers. 1. INTRODUCTION The World-Wide Web has undergone explosive, exponential growth. As a consequence, users find themselves unable to browse the ever-changing, distributed hyperlink structure of the web. Furthermore, they are subjected to information overload – information is too abundant . With the increasing number of information resources on the Web, it is often more difficult to locate the resources that are relevant to a given need. Henceforth, Search Engines[2,3] are becoming equally important tool in locating relevant information.

[1]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[2]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[3]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[4]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[5]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[6]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[7]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[8]  W. Kinsner,et al.  Hypertext Markup Language , 1999 .

[9]  Daniel S. Hirschberg,et al.  Parallel algorithms for the transitive closure and the connected component problems , 1976, STOC '76.

[10]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[12]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.