The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. The PageRank algorithm is often used in ranking web pages, and it is also used in URL ordering for focused crawler. It estimates the page's authority by taking into account the link structure of the Web. However, it assigns each outlink the same weight and is independent of topics, resulting in topic-drift. In this paper, we propose an improved PageRank algorithm, which we called "To-PageRank", and then we present a crawling strategy using the To-PageRank algorithm combining with the topic similarity of the hyperlink metadata. The experiment in focused crawler shows that the new improved crawling strategy has better performance than the Breath-first and PageRank algorithms.
[1]
Filippo Menczer,et al.
Target Seeking Crawlers and their Topical Performance
,
2002
.
[2]
Ricardo A. Baeza-Yates,et al.
Crawling a country: better strategies than breadth-first for web page ordering
,
2005,
WWW '05.
[3]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[4]
Sebastiano Vigna,et al.
Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
,
2004,
WAW.
[5]
Martin F. Porter,et al.
An algorithm for suffix stripping
,
1997,
Program.
[6]
Michael Chau,et al.
Comparison of Three Vertical Search Spiders
,
2003,
Computer.
[7]
Taher H. Haveliwala.
Topic-sensitive PageRank
,
2002,
IEEE Trans. Knowl. Data Eng..