Combination of In-Memory Graph Computation with MapReduce: A Subgraph-Centric Method of PageRank

In order to improve the efficiency of the PageRank algorithm, parallelizing methods, especially the ones based on MapReduce, interest many researchers during the past several years. Previous implementations of the PageRank algorithm on MapReduce ignore the characteristic of locality in distributed systems which is very important to reduce the I/O and network costs. In this paper, we explore the locality property and propose a new method for fast PageRank computation by supporting a subgraph as an input record for map functions. Graph partitioning techniques and a message grouping method are employed to guarantee the efficiency of communication among different subgraphs. Experiments show that our method is significantly more efficient than previous approaches without accuracy loss. The key idea to change the granularity of basic processing units from edges to subgraphs can benefit many other parallelizing algorithms for graph processing.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[4]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[5]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[6]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Er-Wei Bai,et al.  A new approach for aggregated PageRank computation via distributed randomized algorithms , 2011, IEEE Conference on Decision and Control and European Control Conference.

[9]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.