Compliation Techniques for Graphs Algorithms on GPUs

The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set optimizations as well as the hardware platform to use. However, currently when switching between CPUs and GPUs, programmer must re-implement the entire algorithm with different optimizations in order to achieve high performance. We propose the GG GPU compiler, a new graph processing framework that achieves high performance on both CPUs and GPUs using the same algorithm specification. GG significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. GG also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate GG on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11x speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 62 out of the 90 experiment

[1]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[2]  Kai Wang,et al.  GraphQ: Graph Query Processing with Abstraction Refinement - Scalable and Programmable Analytics over Very Large Graphs on a Single PC , 2015, USENIX Annual Technical Conference.

[3]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[4]  Shuai Che,et al.  GasCL: A vertex-centric graph model for GPUs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Matei Zaharia,et al.  Making caches work for graph analytics , 2016, 2017 IEEE International Conference on Big Data (Big Data).

[6]  Rajiv Gupta,et al.  Subway: minimizing data transfer during out-of-GPU-memory graph processing , 2020, EuroSys.

[7]  Jimmy J. Lin,et al.  GraphJet: Real-Time Content Recommendations at Twitter , 2016, Proc. VLDB Endow..

[8]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[9]  Shoaib Kamil,et al.  PriorityGraph: A Unified Programming Model for Optimizing Ordered Graph Algorithms , 2019, ArXiv.

[10]  Kang Chen,et al.  Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System , 2018, ASPLOS.

[11]  Keshav Pingali,et al.  Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[13]  Laxmi N. Bhuyan,et al.  Scalable SIMD-Efficient Graph Processing on GPUs , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[14]  Tim Weninger,et al.  Thinking Like a Vertex , 2015, ACM Comput. Surv..

[15]  Keshav Pingali,et al.  Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms , 2018, Euro-Par.

[16]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[17]  Jure Leskovec,et al.  Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time , 2017, WWW.

[18]  Kai Wang,et al.  Grapple: A Graph System for Static Finite-State Property Checking of Large-Scale Systems Code , 2019, EuroSys.

[19]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[21]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[22]  David A. Patterson,et al.  Reducing Pagerank Communication via Propagation Blocking , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[24]  Keshav Pingali,et al.  Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[25]  Ke Meng,et al.  A pattern based algorithmic autotuner for graph processing on GPUs , 2019, PPoPP.

[26]  Andrew S. Grimshaw,et al.  High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[27]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[28]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[29]  Fan Yao,et al.  XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[30]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[31]  P. J. Narayanan,et al.  A fast GPU algorithm for graph connectivity , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[32]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[33]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[34]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[35]  Shoaib Kamil,et al.  GraphIt: a high-performance graph DSL , 2018, Proc. ACM Program. Lang..

[36]  Kai Wang,et al.  RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine , 2018, OSDI.

[37]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[38]  Tekin Bicer,et al.  Graphphi: efficient parallel graph processing on emerging throughput-oriented architectures , 2018, PACT.

[39]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[40]  Mohan Kumar,et al.  Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[41]  Rajiv Gupta,et al.  KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations , 2017, ASPLOS.

[42]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[43]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[44]  Hang Liu,et al.  SIMD-X: Programming and Processing of Graph Algorithms on GPUs , 2018, USENIX Annual Technical Conference.

[45]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[46]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[47]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[48]  Hao Wang,et al.  SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU , 2019, PPoPP.

[49]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[50]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[52]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[53]  Hai Jin,et al.  Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model , 2018, IEEE Transactions on Knowledge and Data Engineering.

[54]  Keshav Pingali,et al.  A compiler for throughput optimization of graph algorithms on GPUs , 2016, OOPSLA.

[55]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[56]  Yunming Zhang,et al.  Optimizing indirect memory references with milk , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[57]  Torsten Hoefler,et al.  To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations , 2017, HPDC.

[58]  Christoforos E. Kozyrakis,et al.  Making pull-based graph processing performant , 2018, PPoPP.

[59]  John D. Owens,et al.  Implementing Push-Pull Efficiently in GraphBLAS , 2018, ICPP.

[60]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[61]  Jinwook Kim,et al.  GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs , 2016, SIGMOD Conference.

[62]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[63]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[64]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[65]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[66]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[67]  Kai Wang,et al.  Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code , 2017, ASPLOS.

[68]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[69]  S. Pallottino,et al.  Shortest Path Algorithms in Transportation models: classical and innovative aspects , 1997 .

[70]  Zhijia Zhao,et al.  Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing , 2018, ASPLOS.

[71]  Dimitrios S. Nikolopoulos,et al.  GraphGrind: addressing load imbalance of graph partitioning , 2017, ICS.

[72]  Satoshi Matsuoka,et al.  Cache-aware sparse matrix formats for Kepler GPU , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[73]  Rajiv Gupta,et al.  Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[74]  P. Sadayappan,et al.  MultiGraph: Efficient Graph Processing on GPUs , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[75]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[76]  Bo Wu,et al.  Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).