Auto-parallelization of data structure operations for GPUs

We present an auto-parallelization technique for generating GPU implementation of data-structure operations from a sequential specification. The technique partitions the data-structure operations into barrier-separated phases such that each phase executes only homogeneous operations. Homogeneity is dictated by the method type, which is derived from the specification. Two key aspects of our technique are: (i) it ensures linearizability of the data-structure, and (ii) it is capable of composing multiple data-structure operations with the guarantee of optimal barrier placement, which we formally prove. We illustrate the usefulness of our techniques by synthesizing efficient GPU implementations of practical graph algorithms like single-source shortest paths which uses a concurrent worklist, Delaunay mesh refinement that uses a worklist and a mesh, and a doubly linked-list supporting arbitrary insertion and deletion.

[1]  P J Narayanan,et al.  Fast minimum spanning tree for large graphs on the GPU , 2009, High Performance Graphics.

[2]  Keshav Pingali,et al.  Morph algorithms on GPUs , 2013, PPoPP '13.

[3]  Andrey N. Chernikov,et al.  Effective out-of-core parallel Delaunay mesh refinement using off-the-shelf software , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Eran Yahav,et al.  Deriving linearizable fine-grained concurrent objects , 2008, PLDI '08.

[5]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Keshav Pingali,et al.  Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.

[8]  Donald Cohen,et al.  Automating relational operations on data structures , 1993, IEEE Software.

[9]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2009, CACM.

[10]  Armando Solar-Lezama,et al.  Sketching concurrent data structures , 2008, PLDI '08.

[11]  Alexander Aiken,et al.  Concurrent data representation synthesis , 2012, PLDI.

[12]  Thanh-Tung Cao,et al.  Scalable parallel minimum spanning forest computation , 2012, PPoPP '12.

[13]  Andrey N. Chernikov,et al.  Fully Generalized Two-Dimensional Constrained Delaunay Mesh Refinement , 2010, SIAM J. Sci. Comput..

[14]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[15]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[16]  Ondrej Lhoták,et al.  Jedd: a BDD-based relational extension of Java , 2004, PLDI '04.

[17]  Pascal Fradet,et al.  Shape types , 1997, POPL '97.

[18]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[19]  Alexander Aiken,et al.  Data representation synthesis , 2011, PLDI '11.

[20]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[21]  Rishabh Singh,et al.  Synthesizing data structure manipulations from storyboards , 2011, ESEC/FSE '11.

[22]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[23]  FerranteJeanne,et al.  The program dependence graph and its use in optimization , 1987 .

[24]  Andrey N. Chernikov,et al.  Three-dimensional delaunay refinement for multi-core processors , 2008, ICS '08.

[25]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[26]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[27]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[28]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[29]  Keshav Pingali,et al.  A GPU implementation of inclusion-based points-to analysis , 2012, PPoPP '12.