STRank: A SiteRank algorithm using semantic relevance and time frequency

Most of the researches on web information processing are concentrated on the web pages and the hyperlinks among them. One of the important facts that a web page is just one building block of the whole website had been ignored. But the situation is gradually changed in recent years for the needs of website reputation calculation, the high level website structure mining etc. It causes the website ranking become one of the hot research topics and various site ranking algorithms, such as SiteRank, AggregateRank etc., had been proposed. But most of existing website ranking algorithm just take use of website link graphs and the content of websites are usually not put into consideration. It is obviously not enough for a reliable ranking of websites. To address this issue, this paper introduces two content based features, i.e., semantic relevance and time frequency and proposes a new STRank algorithm based on these two features. We firstly conduct a series of experiments to verify the feasibility of these two factors in site ranking task. Then the semantic relevance is applied in the calculation of transition probability, and the updating frequency of sites is combined into the ranking task. Since traditional Kendall's τ distance and Spearman's Footrule distance is not appropriate for the evaluation of site ranking, we make some modifications accordingly to evaluate website ranking algorithms. Finally, our experiments show that the STRank algorithm outperforms existing approaches on both effectiveness and efficiency.

[1]  Tie-Yan Liu,et al.  AggregateRank: bringing order to web sites , 2006, SIGIR '06.

[2]  Christopher C. Yang,et al.  Extracting a website's content structure from its link structure , 2005, CIKM '05.

[3]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[4]  Yan Zhang,et al.  SiteRank-Based Crawling Ordering Strategy for Search Engines , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[5]  Karl Aberer,et al.  A Framework for Decentralized Ranking in Web Information Retrieval , 2003, APWeb.

[6]  Karl Aberer,et al.  Using SiteRank for Decentralized Computation of Web Document Ranking , 2004, AH.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[8]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[9]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[11]  Tao Qin,et al.  Learning to rank relational objects and its application to web search , 2008, WWW.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[14]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[15]  Lei Yang,et al.  Link analysis using time series of web graphs , 2007, CIKM '07.