Specifying and testing GPU workgroup progress models

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.

[1]  Anton Podkopaev,et al.  Making weak memory models fair , 2020, Proc. ACM Program. Lang..

[2]  David A. Wood,et al.  Independent Forward Progress of Work-groups , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[3]  Alastair F. Donaldson,et al.  Putting Randomized Compiler Testing into Production (Experience Report) , 2020, ECOOP.

[4]  Xianwei Zhang,et al.  Autonomous Data-Race-Free GPU Testing , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Alastair F. Donaldson,et al.  One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[6]  D. Grimaldi Amber , 2019, Current Biology.

[7]  Roberto Palmieri,et al.  Don't Forget About Synchronization!: A Case Study of K-Means on GPU , 2019, PMAM@PPoPP.

[8]  Hyesoon Kim,et al.  Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  Alastair F. Donaldson,et al.  GPU Schedulers: How Fair Is Fair Enough? , 2018, CONCUR.

[10]  John Wickerson,et al.  The semantics of transactions and weak memory in x86, Power, ARM, and C++ , 2017, PLDI.

[11]  Alastair F. Donaldson,et al.  Automated testing of graphics shader compilers , 2017, Proc. ACM Program. Lang..

[12]  Daniel Lustig,et al.  Automated Synthesis of Comprehensive Memory Model Litmus Test Suites , 2017, ASPLOS.

[13]  George A. Constantinides,et al.  Automatically comparing memory consistency models , 2017, POPL.

[14]  Ganesh Gopalakrishnan,et al.  Portable inter-workgroup barrier synchronisation for GPUs , 2016, OOPSLA.

[15]  Tor M. Aamodt,et al.  MIMD synchronization on SIMT architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Alastair F. Donaldson,et al.  Exposing errors related to weak memory in GPU applications , 2016, PLDI.

[17]  Martin Burtscher,et al.  Higher-order and tuple-based massively-parallel prefix sums , 2016, PLDI.

[18]  Bruce Merry,et al.  A Performance Comparison of Sort and Scan Libraries for GPUs , 2015, Parallel Process. Lett..

[19]  Wen-mei W. Hwu,et al.  Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .

[20]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[21]  John Wickerson,et al.  The Design and Implementation of a Verification Technique for GPU Kernels , 2015, TOPL.

[22]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[23]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[24]  Alastair F. Donaldson,et al.  Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels , 2013, ESOP.

[25]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[26]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[27]  Alexander Knapp,et al.  On the Correctness of the SIMT Execution Model of GPUs , 2012, ESOP.

[28]  Peng Li,et al.  GKLEE: concolic verification and test generation for GPUs , 2012, PPoPP '12.

[29]  Radu Mateescu,et al.  CADP 2011: a toolbox for the construction and analysis of distributed processes , 2012, International Journal on Software Tools for Technology Transfer.

[30]  Wu-chun Feng,et al.  CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[31]  Radu Mateescu,et al.  A Study of Shared-Memory Mutual Exclusion Protocols Using CADP , 2010, FMICS.

[32]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[33]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[35]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[36]  Radu Mateescu,et al.  A Model Checking Language for Concurrent Value-Passing Systems , 2008, FM.

[37]  Christel Baier,et al.  Principles of model checking , 2008 .

[38]  Daniel Jackson,et al.  Software Abstractions - Logic, Language, and Analysis , 2006 .

[39]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[40]  Dexter Kozen,et al.  RESULTS ON THE PROPOSITIONAL’p-CALCULUS , 2001 .

[41]  Saharon Shelah,et al.  On the temporal analysis of fairness , 1980, POPL '80.

[42]  Saharon Shelah,et al.  On the Temporal Basis of Fairness. , 1980 .