Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, it can result in skewed task-distribution, leading to poor load-balance. In contrast, if the work-assignment uses edge-based graph partitioning, the load-balancing is better, but the memory requirement is relatively higher. This makes it unsuitable for large graphs. In this work, we propose three techniques for improved load-balancing of graph applications on GPUs. Each technique brings in unique advantages, and a user may have to employ a specific technique based on the requirement. Using Breadth First Search and Single Source Shortest Paths as our processing kernels, we illustrate the effectiveness of each of the proposed techniques in comparison to the existing node-based and edge-based mechanisms.

[1]  Fabio Checconi,et al.  Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[2]  KarrasPanagiotis,et al.  Scalable parallel minimum spanning forest computation , 2012 .

[3]  David A. Bader,et al.  Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Keshav Pingali,et al.  Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[6]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[7]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  David A. Bader,et al.  National Laboratory Lawrence Berkeley National Laboratory Title A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets Permalink , 2009 .

[10]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[12]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[13]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[14]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[15]  Thanh-Tung Cao,et al.  Scalable parallel minimum spanning forest computation , 2012, PPoPP '12.

[16]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[17]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  P J Narayanan,et al.  Fast minimum spanning tree for large graphs on the GPU , 2009, High Performance Graphics.

[19]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  David A. Bader,et al.  Ieee Transactions on Parallel and Distributed Systems Graphct: Multithreaded Algorithms for Massive Graph Analysis , 2022 .

[21]  Ümit V. Çatalyürek,et al.  Betweenness centrality on GPUs and heterogeneous architectures , 2013, GPGPU@ASPLOS.

[22]  Keshav Pingali,et al.  Morph algorithms on GPUs , 2013, PPoPP '13.

[23]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).