XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs

Attracted by the enormous potentials of Graphics Processing Units (GPUs), an array of efforts has surged to deploy Breadth-First Search (BFS) on GPUs, which, however, often exploits the static mechanisms to address the challenges that are dynamic in nature. Such a mismatch prevents us from achieving the optimal performance for offloading graph traversal on GPUs. To this end, we propose XBFS that leverages the runtime optimizations atop GPUs to cope with the nondeterministic characteristics of BFS with the following three techniques: First, XBFS adaptively exploits four either new or optimized frontier queue generation designs to accommodate various BFS levels that present dissimilar features. Second, inspired by the observation that the workload associated with each vertex is not proportional to its degree in bottom-up, we design three new strategies to better balance the workload. Third, XBFS introduces the first truly asynchronous bottom-up traversal which allows BFS to visit vertices for multiple levels at a single iteration with both theoretical soundness and practical benefits. Taken together, XBFS is, on average, 3.5×, 4.9×, 11.2× and 6.1× faster than the state-of-the-art Enterprise, Tigr, Gunrock on a Quadro P6000 GPU and Ligra on a 24-core Intel Xeon Platinum 8175M CPU. Note, the CPU used for Ligra is more expensive than the GPU for XBFS.

[1]  Guy E. Blelloch,et al.  Parallelism in Randomized Incremental Algorithms , 2018, J. ACM.

[2]  H. Howie Huang,et al.  iBFS: Concurrent Breadth-First Search on GPUs , 2016, SIGMOD Conference.

[3]  David A. Bader,et al.  Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Laxmi N. Bhuyan,et al.  Scalable SIMD-Efficient Graph Processing on GPUs , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[6]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[7]  Hang Liu,et al.  SIMD-X: Programming and Processing of Graph Algorithms on GPUs , 2018, USENIX Annual Technical Conference.

[8]  H. Howie Huang,et al.  T ri C ore : parallel triangle counting on GPUs , 2018 .

[9]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  H. Howie Huang,et al.  High-Performance Triangle Counting on GPUs , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[11]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[12]  H. Howie Huang,et al.  iSpan: Parallel Identification of Strongly Connected Components with Spanning Trees , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[15]  H. Howie Huang,et al.  CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching , 2019, SIGMOD Conference.

[16]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[17]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[18]  H. Howie Huang,et al.  TriCore: Parallel Triangle Counting on GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Bo Wu,et al.  Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Zhijia Zhao,et al.  Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing , 2018, ASPLOS.

[22]  Dimitrios S. Nikolopoulos,et al.  GraphGrind: addressing load imbalance of graph partitioning , 2017, ICS.

[23]  H. Howie Huang,et al.  Graphene: Fine-Grained IO Management for Graph Computing , 2017, FAST.

[24]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[25]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[26]  Peter Schwabe Graphics Processing Units , 2014, Secure Smart Embedded Devices, Platforms and Applications.

[27]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[28]  Sabu M. Thampi,et al.  Survey of Search and Replication Schemes in Unstructured P2p Networks , 2010, Netw. Protoc. Algorithms.

[29]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[30]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[31]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[32]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[33]  Nancy M. Amato,et al.  Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.