Improved Link-Based Algorithms for Ranking Web Pages

Several link-based algorithms, such as PageRank [7], HITS [4] and SALSA [5], have been developed to evaluate the popularity of web pages. These algorithms can be interpreted as computing the steady-state distribution of various Markov processes over web pages. The PageRank and HITS algorithms tend to over-rank tightly interlinked collections of pages, such as well-organized message boards. We show that this effect can be alleviated using a number of modifications to the underlying Markov process. Specifically, rather than weight all outlinks from a given page equally, greater weight is given to links between pages that are, in other respects, further off in the web, and less weight is given to links between pages that are nearby. We have experimented with a number of variants of this idea, using a number of different measures of ”distance” in the Web, and a number of different weighting schemes. We show that these revised algorithms often do avoid the over-ranking problem and give better overall rankings.

[1]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[2]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[3]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[4]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[5]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[6]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[7]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[8]  S. Hinton,et al.  From Home Page to Home Site: Effective Web Resource Discovery at the ANU , 1998, Comput. Networks.

[9]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[10]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[11]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[12]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[13]  Marco Gori,et al.  Web page scoring systems for horizontal and vertical search , 2002, WWW.

[14]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[15]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[16]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[17]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[20]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.