A predication-based approach for effective resource, discovery in topical web

Due to enormous growth of the World Wide Web in recent years, crawling specific topical portions quickly without having to explore all Web pages has become a new challenge for resource discovery. A new idea is to predicate the URL's relevance degree to the topic by related properties of the URL, then crawl the URLs with high probability. In this paper, we do further study on the topic resource and introduce some new properties helpful for more effective relevance predication. We also improve the evaluation algorithm and add two rules to adjust the weights of factors dynamically, which lead to better predication precision. These new issues improve the system performance due to higher topic harvest rate and lower sensitivity to various kinds of initial URL seeds.