Web Crawler

Our project consists of designing and implementing an efficient general purpose web crawler. A web crawler is an automated program that accesses a web site and traverses through the site by following the links present on the pages systematically. The main purpose of web crawlers is to feed a data base with information from the web for later processing by a search engine; purpose that will be the focus of our project. Most related work in this area is associated with popular search engines and their crawling algorithms and detailed architecture is kept as a business secret. However, web crawlers such as RBSE, the WebCrawler, the World Wide Web Worm, the crawler of the Internet Archive, an early version of the Google crawler, Mercator, Salticus, the WebFountain and the WIRE have published descriptions of their architecture. Besides structural issues, research about Web crawling has focused on parallelism, discovery and control of crawlers for Web site administrators, accessing content behind forms (the " hidden " web), detecting mirrors, keeping the freshness of the search engine copy high, long-term scheduling , and focused crawling. In addition, there have been studies on characteristics of the Web, that affect directly the performance of a Web crawler such as detecting communities, characterizing server repose time, studying the distribution of web page changes and proposing protocols for web servers to cooperate with crawlers. [Castillo2004]. Figure 1 shows de architecture of JoMaGic web crawler. This section describes the purpose and implementation of each of the components shown.

[1]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[2]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[3]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.