论文信息 - A New Algorithm of Topical Crawler

A New Algorithm of Topical Crawler

The generic crawler provides more help to people for finding information in WWW. However, it has some drawback in terms of precision and efficiency because of its generality and no specialty. In this paper, we address two issues of the topical web crawler. One is how to make the definition of the topic; the other is how to sort of links to be downloaded in the queue efficiently. It aims to visit only relevant pages, and get a great scale of hyperlinks which link to the relevant pages. The crawl method in this paper is a novel one, which is based on the semi-structured features of the website and content information. The results of experiment show that it is a very effective method for focused crawler.

Zhao Tie-jun | Li Wei-jiang | Zang Wen-mao | Ru Hua-suo

[1] Filippo Menczer,et al. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[2] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[3] Chun Chen,et al. Guide focused crawler efficiently and effectively using on-line topical importance estimation , 2008, SIGIR '08.

[4] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5] C. Lee Giles,et al. Accessibility of information on the web , 1999, Nature.

[6] B. Pinkerton,et al. Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[7] Baowen Xu,et al. Web Page's Blocks Based Topical Crawler , 2008, 2008 IEEE International Symposium on Service-Oriented System Engineering.

[8] Filippo Menczer,et al. Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.