Efficient and decentralized PageRank approximation in a peer-to-peer web search network

PageRank-style (PR) link analyses are a cornerstone of Web search engines and Web mining, but they are computationally expensive. Recently, various techniques have been proposed for speeding up these analyses by distributing the link graph among multiple sites. However, none of these advanced methods is suitable for a fully decentralized PR computation in a peer-to-peer (P2P) network with autonomous peers, where each peer can independently crawl Web fragments according to the user's thematic interests. In such a setting the graph fragments that different peers have locally available or know about may arbitrarily overlap among peers, creating additional complexity for the PR computation.This paper presents the JXP algorithm for dynamically and collaboratively computing PR scores of Web pages that are arbitrarily distributed in a P2P network. The algorithm runs at every peer, and it works by combining locally computed PR scores with random meetings among the peers in the network. It is scalable as the number of peers on the network grows, and experiments as well as theoretical arguments show that JXP scores converge to the true PR scores that one would obtain by a centralized computation.

[1]  Gerhard Weikum,et al.  MINERVA: Collaborative P2P Search , 2005, VLDB.

[2]  Karl Aberer,et al.  A Framework for Decentralized Ranking in Web Information Retrieval , 2003, APWeb.

[3]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[4]  James C. Browne,et al.  Pagerank Computation and Keyword Search on Distributed Systems and P2P Networks , 2003, Journal of Grid Computing.

[5]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[6]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[7]  Richard P. Martin,et al.  Wayfinder: Navigating and Sharing Information in a Decentralized World , 2004, DBISP2P.

[8]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[9]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[10]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[11]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[14]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[15]  C. D. Meyer,et al.  Updating the stationary vector of an irreducible Markov chain , 2002 .

[16]  Torsten Suel,et al.  ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[17]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[18]  Steve Chien,et al.  Link Evolution: Analysis and Algorithms , 2004, Internet Math..

[19]  Amy Nicole Langville,et al.  Updating Markov Chains with an Eye on Google's PageRank , 2005, SIAM J. Matrix Anal. Appl..

[20]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[21]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[22]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[23]  G. Golub,et al.  A Fast Two-Stage Algorithm for Computing PageRank , 2003 .

[24]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[25]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[26]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[27]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[28]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[29]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[30]  Guangwen Yang,et al.  Distributed page ranking in structured P2P networks , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[31]  Karl Aberer,et al.  P-Grid: A Self-Organizing Access Structure for P2P Information Systems , 2001, CoopIS.

[32]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[33]  Andrei Z. Broder,et al.  Efficient PageRank approximation via graph aggregation , 2004, WWW Alt. '04.

[34]  Karl Aberer,et al.  Using a Layered Markov Model for Distributed Web Ranking Computation , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[35]  C. D. Meyer,et al.  Markov chain sensitivity measured by mean first passage times , 2000 .

[36]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[37]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[38]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[39]  Gerhard Weikum,et al.  JXP: Global Authority Scores in a P2P Network , 2005, WebDB.

[40]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..