Web Crawling: Important Factor for Web Search Engines in Information Retrieval

Web search is currently generating more than 13% of the traffic to Web sites. The main problem search engines have to deal with is the size of the Web, which currently is in the order of thousands of millions of pages. This large size induces a low coverage, with no search engine indexing more than one third of the publicly available Web. The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus" within their crawlers to harvest application or topic specific collections. We discuss the basic issues related with developing a crawling infrastructure. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. While many innovative applications of Web crawling are still being invented, we take a brief look at some developed in the past.