Gunrock: a high-performance graph processing library on the GPU

For large-scale graph analytics on the GPU, the irregularity of data access and control flow and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system, uses a high-level bulk-synchronous abstraction with traversal and computation steps, designed specifically for the GPU. Gunrock couples high performance with a high-level programming model that allows programmers to quickly develop new graph primitives with less than 300 lines of code. We evaluate Gunrock on five graph primitives and show that Gunrock has at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.

[1]  John D. Owens,et al.  WTF, GPU! computing twitter's who-to-follow on the GPU , 2014, COSN '14.

[2]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[3]  David A. Bader,et al.  A performance evaluation of open source graph databases , 2014, PPAA '14.

[4]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[5]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[6]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[7]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[8]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[9]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[10]  Jared Hoberock,et al.  Edge v. Node Parallelism for Graph Centrality Metrics , 2012 .

[11]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[12]  John D. Owens,et al.  Performance Characterization of High-Level Programming Models for GPU Graph Analytics , 2015, 2015 IEEE International Symposium on Workload Characterization.

[13]  Matei Ripeanu,et al.  Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures , 2015, Euro-Par Workshops.

[14]  P. J. Narayanan,et al.  A fast GPU algorithm for graph connectivity , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[15]  David A. Bader,et al.  A fast, energy-efficient abstraction for simultaneous breadth-first searches , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Andrew V. Goldberg,et al.  PHAST: Hardware-Accelerated Shortest Path Trees , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  Jim Law,et al.  Review of "The boost graph library: user guide and reference manual by Jeremy G. Siek, Lie-Quan Lee, and Andrew Lumsdaine." Addison-Wesley 2002. , 2003, SOEN.

[18]  Zhisong Fu,et al.  MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs , 2014, GRADES.

[19]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[20]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[21]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[22]  GoelAshish,et al.  The Who-To-Follow System at Twitter , 2015 .

[23]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[24]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[25]  Ümit V. Çatalyürek,et al.  Betweenness centrality on GPUs and heterogeneous architectures , 2013, GPGPU@ASPLOS.

[26]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[27]  Peter Sanders,et al.  Better Approximation of Betweenness Centrality , 2008, ALENEX.

[28]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[29]  David A. Bader,et al.  Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[31]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[32]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[33]  David A. Bader,et al.  Fast Execution of Simultaneous Breadth-First Searches on Sparse Graphs , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[34]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[35]  David A. Bader,et al.  Computing Betweenness Centrality for Small World Networks on a GPU , 2011 .

[36]  Jennifer Widom,et al.  HelP: High-level Primitives For Large-Scale Graph Processing , 2014, GRADES.

[37]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[38]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  John D. Owens,et al.  A GPU Task-Parallel Model with Dependency Resolution , 2012, Computer.