Paradoxical Effects in PageRank Incremental Computations

Deciding which kind of visiting strategy accumulates high-quality pages more quickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectiveness for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields node orders that agree with the ones computed in the complete graph; orders are compared using Kendall's ô . We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.

[1]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[2]  W. Knight A Computer Method for Calculating Kendall's Tau with Ungrouped Data , 1966 .

[3]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[4]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[5]  W. Hoeffding,et al.  Rank Correlation Methods , 1949 .

[6]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[7]  Ola Petersson,et al.  Approximate Indexed Lists , 1998, J. Algorithms.

[8]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[9]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[10]  Shlomo Moran,et al.  Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs , 2005, Information Retrieval.

[11]  Carl D. Meyer,et al.  Langville and Meyer : Deeper Inside PageRank , 2004 .

[12]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[13]  M. Kendall Rank Correlation Methods , 1949 .

[14]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[15]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  Allan Borodin,et al.  Perturbation of the Hyper-Linked Environment , 2003, COCOON.

[18]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[19]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[20]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[21]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[22]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[23]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[24]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[25]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[26]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[27]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[28]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[29]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[30]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[31]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.