Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks

This paper presents a fully distributed computation for Google's pagerank algorithm. The computation is based on solution of the matrix equation defining pageranks by a distributed implementation of asynchronous iteration. Pageranks for the documents stored on a web server or on a host in a peer-to-peer network are computed in place and stored with the documents. The matrix is never assembled and no crawls of the web are required. Continuously accurate pageranks are enabled by incremental computation of pageranks for documents as they are inserted onto a network storage host and incremental recomputation of pageranks when documents are deleted. Intrahost and intradomain dominance of document link structure is naturally exploited by the distributed asynchronous iteration algorithm.Three implementations: (i) a simulation which was previously reported, (ii) an implementation of the algorithm in a peer-to-peer computational system and (iii) an embedding of the computation in web servers, are described. Application of the three implementations to three different workloads, two constructed following power law network models for link distributions and one derived from the Government document database are reported. Convergence for computation of a complete set of pageranks is rapid: 1% accuracy in 10 or fewer messages per document. Incremental computation of pageranks resulting from addition or deletion of documents also converges rapidly, usually requiring 10 or fewer messages per document.Coupling locally stored pageranks with the documents in a peer-to-peer network dramatically diminishes the volume of data which must be transmitted to satisfy keyword searches in peer-to-peer networks.The web server implementation shows that the distributed algorithm can be used to enable web servers to compute pageranks for the documents they store and thus potentially enable effective keyword searches for the documents stored on the web servers of intranets by utilizing unused processing power of the web servers.

[1]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  James C. Browne,et al.  CoorSet: A Development Environment for Associatively Coordinated Components , 2004, COORDINATION.

[3]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[4]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[5]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[6]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[7]  Wayne Kelly,et al.  G 2 Remoting : A Cycle Stealing Framework based on . NET Remoting , 2003 .

[8]  Cleve Moler,et al.  The World' s Largest Matrix Computation Google's PageRank is an eigenvector of a matrix of order 2.7 billion , 2002 .

[9]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[10]  Dennis Gannon,et al.  XCAT 2 . 0 : A Component-Based Programming Model for Grid Web Services , 2002 .

[11]  Omprakash D. Gnawali A Keyword-Set Search System for Peer-to-Peer Networks , 2002 .

[12]  Amr Z. Kronfol FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine , 2002 .

[13]  Ian Clarke,et al.  Protecting Free Expression Online with Freenet , 2002, IEEE Internet Comput..

[14]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[15]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[16]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[17]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[18]  Greg Ruetsch,et al.  Framework for Peer-to-Peer Distributed Computing in a Heterogeneous, Decentralized Environment , 2002, GRID.

[19]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[20]  Ian J. Taylor,et al.  Distributed P2P computing within Triana: a galaxy visualization test case , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[21]  J. Strikwerda A probabilistic analysis of asynchronous iteration , 2002 .

[22]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[23]  Taher H. Haveliwala,et al.  The Second Eigenvalue of the Google Matrix , 2003 .

[24]  Taher H. Haveliwala,et al.  Adaptive methods for the computation of PageRank , 2004 .

[25]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[26]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27]  Steven Newhouse,et al.  Autonomic service adaptation in ICENI using ontological annotation , 2003, Proceedings. First Latin American Web Congress.

[28]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[29]  D. Szyld,et al.  On asynchronous iterations , 2000 .

[30]  L. Eldén A Note on the Eigenvalues of the Google Matrix , 2004, math/0401177.

[31]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[32]  Hector Garcia-Molina,et al.  The Eigentrust algorithm for reputation management in P2P networks , 2003, WWW '03.

[33]  Yuan Shi,et al.  Timing Models and Local Stopping Criteria for Asynchronous Iterative Algorithms , 1999, J. Parallel Distributed Comput..

[34]  James C. Browne,et al.  An Associative Broadcast Based Coordination Model for Distributed Processes , 2002, COORDINATION.

[35]  James C. Browne,et al.  Distributed pagerank for P2P systems , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.