Research and Implementation of Improved Real-Time Crawler Modeling

The past decade has witnessed the rapid development of search engines, which has become an indispensable part of everyday life. However, people are no longer satisfied with accessing to ordinary information, and they may instead pay more attention to fresh information. This demand poses challenges to traditional search engines, which concern more about relevance and importance of web pages. A search engine compresses three modules: crawler, indexer and searcher. Changes are needed for all these three parts to improve search engine's freshness. This paper investigates the first part of search engine crawler, we analyze the requirements for real-time crawler, and propose a novel real-time crawler based on more accurate estimation of refresh time. Experimental results demonstrate that the proposed real-time crawler can help search engine improve its freshness.