IglooG: A Distributed Web Crawler Based on Grid Service

Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the crawlers and are deployed as grid service. Information services are organized as Peer-to-Peer overlay network. According to the ID of crawler and semantic vector of crawl page that is computed by Latent Semantic Indexing, crawler can decide whether transmits the URL to information service or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment of Grid to evaluate the balancing load on the crawlers and crawl speed. Both the theoretical analysis and the experimental results show that our system is a high-performance and reliable system.

[1]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[4]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[5]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[6]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[7]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[8]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[9]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[10]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[11]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[12]  Zhichen Xu,et al.  pSearch: information retrieval in structured overlays , 2003, CCRV.

[13]  Monika R. Henzinger Hyperlink Analysis for the Web Hyperlink analysis algorithms allow search engines to deliver focused results to user queries.This article surveys ranking algorithms used to retrieve information on the Web. , 2001 .