论文信息 - A predication-based approach for effective resource, discovery in topical web

A predication-based approach for effective resource, discovery in topical web

Due to enormous growth of the World Wide Web in recent years, crawling specific topical portions quickly without having to explore all Web pages has become a new challenge for resource discovery. A new idea is to predicate the URL's relevance degree to the topic by related properties of the URL, then crawl the URLs with high probability. In this paper, we do further study on the topic resource and introduce some new properties helpful for more effective relevance predication. We also improve the evaluation algorithm and add two rules to adjust the weights of factors dynamically, which lead to better predication precision. These new issues improve the system performance due to higher topic harvest rate and lower sensitivity to various kinds of initial URL seeds.

Jun Wang | Lianhong Cai | Liang Ma | Qunxiu Chen | Guowei Xu

[1] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[2] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[4] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[5] Philip S. Yu,et al. Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[6] Jian Zhang,et al. On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.