Gunrock

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. “Gunrock,” our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock’s overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries, such as Ligra and Galois, and better performance than any other GPU high-level graph library.

[1]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[4]  Takao Terano,et al.  Knowledge Discovery and Data Mining. Current Issues and New Applications , 2000, Lecture Notes in Computer Science.

[5]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[6]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[7]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[8]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[9]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[10]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[11]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[12]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[13]  Zhengyu He,et al.  Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[15]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[16]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[17]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[18]  P. J. Narayanan,et al.  A fast GPU algorithm for graph connectivity , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  V. S. Subrahmanian,et al.  COSI: Cloud Oriented Subgraph Identification in Massive Social Networks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[20]  David A. Bader,et al.  Computing Betweenness Centrality for Small World Networks on a GPU , 2011 .

[21]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[22]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[23]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[24]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[26]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[27]  John D. Owens,et al.  A GPU Task-Parallel Model with Dependency Resolution , 2012, Computer.

[28]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[29]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Jared Hoberock,et al.  Edge v. Node Parallelism for Graph Centrality Metrics , 2012 .

[31]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[32]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[33]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[34]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[35]  Andrew V. Goldberg,et al.  PHAST: Hardware-accelerated shortest path trees , 2013, J. Parallel Distributed Comput..

[36]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[37]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[38]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[39]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[40]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[41]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[42]  Matei Ripeanu,et al.  Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems , 2013, ArXiv.

[43]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[44]  Ümit V. Çatalyürek,et al.  Betweenness centrality on GPUs and heterogeneous architectures , 2013, GPGPU@ASPLOS.

[45]  Jennifer Widom,et al.  HelP: High-level Primitives For Large-Scale Graph Processing , 2014, GRADES.

[46]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[47]  David A. Bader,et al.  Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[49]  John D. Owens,et al.  WTF, GPU! computing twitter's who-to-follow on the GPU , 2014, COSN '14.

[50]  Nancy M. Amato,et al.  Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  David A. Bader,et al.  A performance evaluation of open source graph databases , 2014, PPAA '14.

[52]  Zhisong Fu,et al.  MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs , 2014, GRADES.

[53]  Sivasankaran Rajamanickam,et al.  BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[54]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[55]  Lluís-Miquel Munguía,et al.  Fast triangle counting on the GPU , 2014, IA3 '14.

[56]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[57]  David A. Bader,et al.  Load balanced clustering coefficients , 2014, PPAA '14.

[58]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[59]  David A. Bader,et al.  Fast Execution of Simultaneous Breadth-First Searches on Sparse Graphs , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[60]  David A. Bader,et al.  A fast, energy-efficient abstraction for simultaneous breadth-first searches , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[61]  Nicola Bombieri,et al.  BFS-4K: An Efficient Implementation of BFS for Kepler GPU Architectures , 2015, IEEE Transactions on Parallel and Distributed Systems.

[62]  Dong Wang,et al.  The Who-To-Follow System at Twitter: Strategy, Algorithms, and Revenue Impact , 2015, Interfaces.

[63]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[64]  Keshav Pingali,et al.  Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[65]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[66]  John D. Owens,et al.  Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[67]  John D. Owens,et al.  Performance Characterization of High-Level Programming Models for GPU Graph Analytics , 2015, 2015 IEEE International Symposium on Workload Characterization.

[68]  Hai Jin,et al.  Optimization of asynchronous graph processing on GPU with hybrid coloring model , 2015, PPoPP.

[69]  Matei Ripeanu,et al.  Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures , 2015, Euro-Par Workshops.

[70]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[71]  Jinwook Kim,et al.  GStream: a graph streaming processing method for large-scale graphs on GPUs , 2015, PPoPP.

[72]  Kunle Olukotun,et al.  EmptyHeaded: Boolean Algebra Based Graph Processing , 2015, ArXiv.

[73]  Bingsheng He,et al.  Fast Subgraph Matching on Large Graphs using Graphics Processors , 2015, DASFAA.

[74]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[75]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[76]  Adam Polak,et al.  Counting Triangles in Large Graphs on GPU , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[77]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[78]  John D. Owens,et al.  A Comparative Study on Exact Triangle Counting Algorithms on the GPU , 2016, HPGP@HPDC.

[79]  H. Howie Huang,et al.  iBFS: Concurrent Breadth-First Search on GPUs , 2016, SIGMOD Conference.

[80]  Keshav Pingali,et al.  A compiler for throughput optimization of graph algorithms on GPUs , 2016, OOPSLA.

[81]  Ulrich Meyer,et al.  GPU multisplit , 2016, PPoPP.

[82]  Vivek Sarkar,et al.  Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , 2017, PPOPP.

[83]  John D. Owens,et al.  Multi-GPU Graph Analytics , 2015, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).