The Design and Implementation of a Verification Technique for GPU Kernels

We present a technique for the formal verification of GPU kernels, addressing two classes of correctness properties: data races and barrier divergence. Our approach is founded on a novel formal operational semantics for GPU kernels termed <i>synchronous, delayed visibility (SDV)</i> semantics, which captures the execution of a GPU kernel by multiple groups of threads. The SDV semantics provides operational definitions for barrier divergence and for both inter- and intra-group data races. We build on the semantics to develop a method for reducing the task of verifying a massively parallel GPU kernel to that of verifying a sequential program. This completely avoids the need to reason about thread interleavings, and allows existing techniques for sequential program verification to be leveraged. We describe an efficient encoding of data race detection and propose a method for automatically inferring the loop invariants that are required for verification. We have implemented these techniques as a practical verification tool, GPUVerify, that can be applied directly to OpenCL and CUDA source code. We evaluate GPUVerify with respect to a set of 162 kernels drawn from public and commercial sources. Our evaluation demonstrates that GPUVerify is capable of efficient, automatic verification of a large number of real-world kernels.

[1]  Ethel Bardsley,et al.  Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels , 2014, NASA Formal Methods.

[2]  Mark J. Harris Fast fluid dynamics simulation on the GPU , 2005, SIGGRAPH Courses.

[3]  Cesare Tinelli,et al.  Instantiation-Based Invariant Discovery , 2011, NASA Formal Methods.

[4]  Kenneth L. McMillan,et al.  Verification of Infinite State Systems by Compositional Model Checking , 1999, CHARME.

[5]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[6]  Daniel Kroening,et al.  Automatic analysis of DMA races using model checking and k-induction , 2011, Formal Methods Syst. Des..

[7]  Alexander Knapp,et al.  On the Correctness of the SIMT Execution Model of GPUs , 2012, ESOP.

[8]  Peng Li,et al.  GKLEE: concolic verification and test generation for GPUs , 2012, PPoPP '12.

[9]  Sorin Lerner,et al.  Verifying GPU kernels by test amplification , 2012, PLDI.

[10]  Mark R. Tuttle,et al.  Going with the Flow: Parameterized Verification Using Message Flows , 2008, 2008 Formal Methods in Computer-Aided Design.

[11]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[13]  Ross T. Whitaker,et al.  GIST: an interactive, GPU-based level set segmentation tool for 3D medical images , 2004, Medical Image Anal..

[14]  Graham Pullan,et al.  BarraCUDA - a fast short read sequence aligner using graphics processing units , 2011, BMC Research Notes.

[15]  Paul H. J. Kelly,et al.  Symbolic Testing of OpenCL Code , 2011, Haifa Verification Conference.

[16]  John Wickerson,et al.  KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters , 2014, IWOCL '14.

[17]  John Wickerson Syntax and semantics of a GPU kernel programming language , 2014, Arch. Formal Proofs.

[18]  Dawson R. Engler,et al.  A few billion lines of code later , 2010, Commun. ACM.

[19]  Christian Urban,et al.  Formal SOS-Proofs for the Lambda-Calculus , 2008, LSFA.

[20]  Bjarne Steensgaard,et al.  Points-to analysis in almost linear time , 1996, POPL '96.

[21]  V. Rich Personal communication , 1989, Nature.

[22]  P. Madhusudan,et al.  Thread contracts for safe parallelism , 2011, PPoPP '11.

[23]  Alastair F. Donaldson,et al.  A sound and complete abstraction for reasoning about parallel prefix sums , 2014, POPL.

[24]  Paul H. J. Kelly,et al.  Barrier invariants: a shared state abstraction for the analysis of data-dependent GPU kernels , 2013, OOPSLA.

[25]  Shuvendu K. Lahiri,et al.  Complexity and Algorithms for Monomial and Clausal Predicate Abstraction , 2009, CADE.

[26]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[27]  Michael Boyer Automated Dynamic Analysis of CUDA Programs , 2008 .

[28]  Stavros Tripakis,et al.  Checking Equivalence of SPMD Programs Using Non- Interference , 2010 .

[29]  Guodong Li,et al.  Scalable SMT-based verification of GPU kernel functions , 2010, FSE '10.

[30]  Bor-Yuh Evan Chang,et al.  Boogie: A Modular Reusable Verifier for Object-Oriented Programs , 2005, FMCO.

[31]  Daniel Kroening,et al.  Automatic Analysis of Scratch-Pad Memory Code for Heterogeneous Multicore Processors , 2010, TACAS.

[32]  Alastair F. Donaldson The GPUVerify Method: a Tutorial Overview , 2014, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[33]  Sumit Gulwani,et al.  Program verification using templates over predicate abstraction , 2009, PLDI '09.

[34]  Guodong Li,et al.  Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding , 2013, NASA Formal Methods.

[35]  Seungjoon Park,et al.  A Simple Method for Parameterized Verification of Cache Coherence Protocols , 2004, FMCAD.

[36]  Marieke Huisman,et al.  Specification and verification of GPGPU programs , 2013, Sci. Comput. Program..

[37]  Adam Betts,et al.  Engineering a Static Verification Tool for GPU Kernels , 2014, CAV.

[38]  Peng Li,et al.  Parametric flows: Automated behavior equivalencing for symbolic analysis of races in CUDA programs , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[40]  Hassen Saïdi,et al.  Construction of Abstract State Graphs with PVS , 1997, CAV.

[41]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[42]  Atsushi Igarashi,et al.  A Hoare Logic for SIMT Programs , 2013, APLAS.

[43]  Thomas A. Henzinger,et al.  Path invariants , 2007, PLDI '07.

[44]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[45]  K. Rustan M. Leino,et al.  Houdini, an Annotation Assistant for ESC/Java , 2001, FME.

[46]  Zvonimir Rakamaric,et al.  Delay-bounded scheduling , 2011, POPL '11.

[47]  Alastair F. Donaldson,et al.  Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels , 2013, ESOP.

[48]  Lawrence Charles Paulson,et al.  Isabelle/HOL: A Proof Assistant for Higher-Order Logic , 2002 .

[49]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[50]  Kenneth L. McMillan,et al.  Lazy Abstraction with Interpolants , 2006, CAV.

[51]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[52]  K. Rustan M. Leino,et al.  ESC/Java User's Manual , 2000 .