Morph algorithms on GPUs

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs. In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges. We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs. This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

[1]  Andrey N. Chernikov,et al.  Three-dimensional delaunay refinement for multi-core processors , 2008, ICS '08.

[2]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[3]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[4]  L. Paul Chew,et al.  Guaranteed-quality mesh generation for curved surfaces , 1993, SCG '93.

[5]  Thanh-Tung Cao,et al.  Scalable parallel minimum spanning forest computation , 2012, PPoPP '12.

[6]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[7]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2009, CACM.

[8]  Eliana Scheihing,et al.  A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips , 2012 .

[9]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[10]  Andrey N. Chernikov,et al.  Fully Generalized Two-Dimensional Constrained Delaunay Mesh Refinement , 2010, SIAM J. Sci. Comput..

[11]  Andrey N. Chernikov,et al.  Effective out-of-core parallel Delaunay mesh refinement using off-the-shelf software , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[13]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[16]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[17]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[19]  Keshav Pingali,et al.  A GPU implementation of inclusion-based points-to analysis , 2012, PPoPP '12.

[20]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Riccardo Zecchina,et al.  Survey propagation: An algorithm for satisfiability , 2002, Random Struct. Algorithms.

[22]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[23]  Keshav Pingali,et al.  Structure-driven optimizations for amorphous data-parallel programs , 2010, PPoPP '10.

[24]  M. Mézard,et al.  Threshold values of random K-SAT from the cavity method , 2006 .

[25]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[26]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[27]  Mark T. Jones,et al.  A Parallel Graph Coloring Heuristic , 1993, SIAM J. Sci. Comput..

[28]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[29]  Rupesh Nasre,et al.  Parallel Replication-Based Points-To Analysis , 2012, CC.

[30]  P J Narayanan,et al.  Fast minimum spanning tree for large graphs on the GPU , 2009, High Performance Graphics.

[31]  Tiow Seng Tan,et al.  Computing 2D Constrained Delaunay Triangulation Using the GPU , 2013, IEEE Trans. Vis. Comput. Graph..

[32]  Jonathan Richard Shewchuk,et al.  Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator , 1996, WACG.

[33]  Eliana Scheihing,et al.  A parallel GPU-based algorithm for Delaunay edge-flips , 2011 .

[34]  Matthew Might,et al.  EigenCFA: accelerating flow analysis with GPUs , 2011, POPL '11.