Optimizing ordered graph algorithms with GraphIt

Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2×–3× speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3× speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.

[1]  Maleen Abeydeera,et al.  Chronos: Efficient Speculative Parallelism for Accelerators , 2020, ASPLOS.

[2]  Nathan Beckmann,et al.  PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[3]  Julian Shun,et al.  Low-latency graph streaming using compressed purely-functional trees , 2019, PLDI.

[4]  Keshav Pingali,et al.  Phoenix: A Substrate for Resilient Distributed Graph Analytics , 2019, ASPLOS.

[5]  Rajiv Gupta,et al.  PnP: Pruning and Prediction for Point-To-Point Iterative Graph Analytics , 2019, ASPLOS.

[6]  Ke Meng,et al.  A pattern based algorithmic autotuner for graph processing on GPUs , 2019, PPoPP.

[7]  Daniel Sánchez,et al.  Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Xiaosong Ma,et al.  Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[10]  Andreas Gerstlauer,et al.  Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction , 2018, Proc. VLDB Endow..

[11]  Shoaib Kamil,et al.  GraphIt: a high-performance graph DSL , 2018, Proc. ACM Program. Lang..

[12]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Dan Alistarh,et al.  Distributionally Linearizable Data Structures , 2018, SPAA.

[14]  Kang Chen,et al.  Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System , 2018, ASPLOS.

[15]  Ying Liu,et al.  Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation , 2018, PPoPP.

[16]  Christoforos E. Kozyrakis,et al.  Making pull-based graph processing performant , 2018, PPoPP.

[17]  Frédo Durand,et al.  Halide , 2017, Commun. ACM.

[18]  Guy E. Blelloch,et al.  Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing , 2017, SPAA.

[19]  Daniel Sánchez,et al.  Fractal: An execution model for fine-grain nested speculative parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Rajiv Gupta,et al.  KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations , 2017, ASPLOS.

[21]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[22]  Sherif Sakr,et al.  Large-Scale Graph Processing Using Apache Giraph , 2017, Springer International Publishing.

[23]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[24]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[25]  Keshav Pingali,et al.  A compiler for throughput optimization of graph algorithms on GPUs , 2016, OOPSLA.

[26]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Daniel Sánchez,et al.  Data-centric execution of speculative parallel programs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Matei Zaharia,et al.  Making caches work for graph analytics , 2016, 2017 IEEE International Conference on Big Data (Big Data).

[29]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[31]  M. Fafchamps,et al.  Aspire , 2015 .

[32]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[33]  Keshav Pingali,et al.  Kinetic Dependence Graphs , 2015, ASPLOS.

[34]  Andy T. Riffel,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP 2016.

[35]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[36]  Rajiv Gupta,et al.  ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM , 2014, OOPSLA.

[37]  Nancy M. Amato,et al.  KLA: A new algorithmic paradigm for parallel graph computations , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[38]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[39]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[40]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[41]  Carlos Guestrin,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 31 Graphchi: Large-scale Graph Computation on Just a Pc , 2022 .

[42]  Guy E. Blelloch,et al.  Parallel and I/O efficient set covering algorithms , 2012, SPAA '12.

[43]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[44]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[45]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[46]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[47]  Keshav Pingali,et al.  Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[48]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[49]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[50]  Leland L. Beck,et al.  Smallest-last ordering and clustering and graph coloring algorithms , 1983, JACM.

[51]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[52]  Carlos Guestrin,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012 .

[53]  A. Goldberg,et al.  The shortest path problem : ninth DIMACS implementation challenge , 2009 .

[54]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.