To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.

[1]  David A. Bader,et al.  Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  David A. Bader,et al.  National Laboratory Lawrence Berkeley National Laboratory Title A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets Permalink , 2009 .

[6]  Michael Stonebraker,et al.  Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[7]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[8]  Richard W. Vuduc,et al.  Branch-Avoiding Graph Algorithms , 2014, SPAA.

[9]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Lenar Iskhakov,et al.  Algorithms and Models for the Web Graph , 2018, Lecture Notes in Computer Science.

[11]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[12]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[13]  David A. Patterson,et al.  GAIL: the graph algorithm iron law , 2015, IA3@SC.

[14]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[15]  S. Teng,et al.  Optimal Tree Contraction in the EREW Model , 1988 .

[16]  Baruch Awerbuch,et al.  New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM , 1987, IEEE Transactions on Computers.

[17]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[18]  Torsten Hoefler,et al.  Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[20]  Keshav Pingali,et al.  Betweenness centrality: algorithms and implementations , 2013, PPoPP '13.

[21]  Torsten Hoefler,et al.  Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[22]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[23]  A. Rbnyi ON THE EVOLUTION OF RANDOM GRAPHS , 2001 .

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[26]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[27]  Jennifer Widom,et al.  Optimizing Graph Algorithms on Pregel-like Systems , 2014, Proc. VLDB Endow..

[28]  Yang Zhao,et al.  A Model of Computation with Push and Pull Processing , 2003 .

[29]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Ge Yu,et al.  Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing , 2016, SIGMOD Conference.

[31]  Fabio Checconi,et al.  Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[32]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[33]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[34]  Kamesh Munagala,et al.  Complexity Measures for Map-Reduce, and Comparison to Parallel Computing , 2012, ArXiv.

[35]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[36]  Rajesh Sundaresan,et al.  An asymptotically optimal push-pull method for multicasting over a random network , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[37]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[38]  Chen Avin,et al.  Tight bounds for algebraic gossip on graphs , 2010, 2010 IEEE International Symposium on Information Theory.

[39]  Ümit V. Çatalyürek,et al.  A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers , 2005, Euro-Par.

[40]  Gary L. Miller,et al.  An Improved Parallel Algorithm that Computes the BFS Numbering of a Directed Graph , 1988, Information Processing Letters.

[41]  Thomas Schank,et al.  Algorithmic Aspects of Triangle-Based Network Analysis , 2007 .

[42]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[43]  Inderjit S. Dhillon,et al.  Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned , 2015, Euro-Par.

[44]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[45]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[46]  Torsten Hoefler,et al.  SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[47]  Torsten Hoefler,et al.  Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[48]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[49]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[50]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[51]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[52]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[53]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[54]  B. Bollobás The evolution of random graphs , 1984 .

[55]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[56]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[57]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[58]  Satoshi Matsuoka,et al.  Performance characteristics of Graph500 on large-scale distributed environment , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[59]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[60]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[61]  Tim J. Harris,et al.  A survey of PRAM simulation techniques , 1994, CSUR.

[62]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[63]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[64]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[65]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[66]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[67]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[68]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .