Crawler Architecture using Grid Computing

Crawler is one of the main components in the search engines which use URLs to fetch web pages to build a repository of web pages starting with entering URL. Each web page is parsed to extract the URLs included in it and store the extracted URLs in the URLs Queue to fetch by the crawlers in sequential. The process of crawling takes long time to collect more web pages, and it has become necessary to utilize the unused computing resources and cost/time savings in organizations. This paper deals with the crawler of search engine using grid computing. This paper presents the grid computing that has been implemented by Alchemi. Alchemi is an open source project developed at the University of Melbourne, provides middleware for creating an enterprise grid computing environment. The crawling processes are passed to Alchemi manager which distribute the processes over a number of computers as executors. The search engine crawler with the grid computing is implemented, tested and the results are analyzed. There is an increase in performance and less time over the single computer.

[1]  Edward A. Fox,et al.  Web Traffic Latency: Characteristics and Implications , 1998, J. Univers. Comput. Sci..

[2]  Rajkumar Buyya,et al.  ExcelGrid: A .NET Plug-in for Outsourcing Excel Spreadsheet Workload to Enterprise and Global Grids , 2004 .

[3]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[4]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[5]  Dong-Hoon Choi,et al.  OGSA-DWC: A Middleware for Deep Web Crawling Using the Grid , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[7]  Ahmar Abbas,et al.  Grid Computing: A Practical Guide to Technology and Applications , 2003 .

[8]  Rajkumar Buyya,et al.  Peer-to-Peer Grid Computing and a .NET-Based Alchemi Framework , 2006 .

[9]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[11]  M. Ben-Mubarak,et al.  Multi Agent System-based crawlers for Virtual Organizations , 2006, The 2nd International Conference on Distributed Frameworks for Multimedia Applications.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Berkant Barla Cambazoglu,et al.  Architecture of a grid-enabled Web search engine , 2007, Inf. Process. Manag..

[14]  M. Koster,et al.  Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[15]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .

[16]  A. Guerriero,et al.  A dynamic URL assignment method for parallel web crawler , 2010, 2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications.