PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graph-parallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  K. Mani Chandy,et al.  The drinking philosophers problem , 1984, ACM Trans. Program. Lang. Syst..

[3]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[4]  Ümit V. Çatalyürek,et al.  Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication , 1996, IRREGULAR.

[5]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[6]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[7]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[8]  Albert-László Barabási,et al.  Error and attack tolerance of complex networks , 2000, Nature.

[9]  Kevin J. Lang Finding good nearly balanced cuts in power law graphs , 2004 .

[10]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[11]  Edsger W. Dijkstra,et al.  Hierarchical ordering of sequential processes , 1971, Acta Informatica.

[12]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[13]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14]  George Karypis,et al.  Multilevel algorithms for partitioning power-law graphs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[16]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[17]  Sebastiano Vigna,et al.  Temporal Evolution of the UK Web , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[18]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[20]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[21]  David R. O'Hallaron,et al.  Distributed Parallel Inference on Large Factor Graphs , 2009, UAI.

[22]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[23]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[24]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[25]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[26]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[27]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[28]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[29]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[30]  Arthur Gretton,et al.  Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees , 2011, AISTATS.

[31]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[32]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[33]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[34]  Enhong Chen,et al.  Kineograph: taking the pulse of a fast-changing and connected world , 2012, EuroSys '12.

[35]  Pablo Rodriguez,et al.  The little engine(s) that could: scaling online social networks , 2010, SIGCOMM '10.

[36]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[37]  Gabriel Kliot,et al.  Streaming graph partitioning for large distributed graphs , 2012, KDD.

[38]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[39]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[40]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.