I/O-efficient techniques for computing pagerank

Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user; this means that it can be precomputed and then used to optimize the layout of the inverted index structure accordingly. However, computing the Pagerank measure requires implementing an iterative process on a massive graph corresponding to billions of web pages and hyperlinks.In this paper, we study I/O-efficient techniques to perform this iterative computation. We derive two algorithms for Pagerank based on techniques proposed for out-of-core graph algorithms, and compare them to two existing algorithms proposed by Haveliwala. We also consider the implementation of a recently proposed topic-sensitive version of Pagerank. Our experimental results show that for very large data sets, significant improvements over previous results can be achieved on machines with moderate amounts of memory. On the other hand, at most minor improvements are possible on data sets that are only moderately larger than memory, which is the case in many practical scenarios.

[1]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2]  Raymie Stata,et al.  The Link Database: fast access to graphs of the Web , 2002, Proceedings DCC 2002. Data Compression Conference.

[3]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[4]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[5]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[6]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[7]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[8]  Jeffery R. Westbrook,et al.  A Functional Approach to External Graph Algorithms , 1998, Algorithmica.

[9]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[10]  Jean-Loup Guillaume,et al.  Efficient and Simple Encodings for the Web Graph , 2002, WAIM.

[11]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[12]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[13]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[14]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, COCOON.

[15]  Torsten Suel,et al.  Second-Order Methods for Distributed Approximate Single- and Multicommodity Flow , 1998, RANDOM.

[16]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[17]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[18]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[19]  Aravind Srinivasan,et al.  Multicommodity flow and circuit switching , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[20]  Dell Zhang,et al.  An efficient algorithm to rank Web resources , 2000, Comput. Networks.

[21]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[22]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[23]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[26]  Baruch Awerbuch,et al.  A simple local-control approximation algorithm for multicommodity flow , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[27]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[28]  Serge A. Plotkin,et al.  Fast approximation algorithm for minimum cost multicommodity flow , 1995, SODA '95.

[29]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[30]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[31]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[32]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[33]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[34]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[35]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[36]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[37]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[38]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[39]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.