A sound and complete abstraction for reasoning about parallel prefix sums

Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length $n$ if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg(n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg(n)) space and the result O(n2 \lg(n)) space, and is more feasible for large n than a method by Voigtlaender that uses O(n) space for the input and result but requires running O(n2) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction. We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlaender's method when applied to large arrays.

[1]  Ralf Hinze An Algebra of Scans , 2004, MPC.

[2]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[3]  Peng Li,et al.  GKLEE: concolic verification and test generation for GPUs , 2012, PPoPP '12.

[4]  Sorin Lerner,et al.  Verifying GPU kernels by test amplification , 2012, PLDI.

[5]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[6]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[7]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Mary Sheeran,et al.  Functional and dynamic programming in the design of parallel prefix networks , 2010, Journal of Functional Programming.

[10]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[11]  Paul H. J. Kelly,et al.  Barrier invariants: a shared state abstraction for the analysis of data-dependent GPU kernels , 2013, OOPSLA.

[12]  Guodong Li,et al.  Scalable SMT-based verification of GPU kernel functions , 2010, FSE '10.

[13]  Harold S. Stone,et al.  Parallel Processing with the Perfect Shuffle , 1971, IEEE Transactions on Computers.

[14]  Igor Sergeev On the complexity of parallel prefix circuits , 2013, Electron. Colloquium Comput. Complex..

[15]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[16]  Benjamin C. Pierce,et al.  Types and programming languages: the next generation , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[17]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[18]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[19]  Paul H. J. Kelly,et al.  Symbolic Testing of OpenCL Code , 2011, Haifa Verification Conference.

[20]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[21]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[22]  Sanjeev Saxena,et al.  On Parallel Prefix Computation , 1994, Parallel Process. Lett..

[23]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[24]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[25]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[26]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[27]  Jesper Larsson Träff,et al.  Parallel Prefix (Scan) Algorithms for MPI , 2006, PVM/MPI.

[28]  E. Gallopoulos,et al.  A parallel method for fast and practical high-order newton interpolation , 1990 .

[29]  Janis Voigtländer Much ado about two (pearl): a pearl on parallel prefix computation , 2008, POPL '08.

[30]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[31]  Alexander Knapp,et al.  On the Correctness of the SIMT Execution Model of GPUs , 2012, ESOP.

[32]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[33]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Youfeng Wu,et al.  Optimizing Data Parallel Operations on Many-Core Platforms , 2006 .

[35]  Marieke Huisman,et al.  Specification and Verification of GPGPU Programs using Permission-Based Separation Logic , 2013 .

[36]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[37]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[38]  Alastair F. Donaldson,et al.  Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels , 2013, ESOP.

[39]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .