Remote-scope promotion: clarified, rectified, and verified

Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon.

[1]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[2]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[3]  Philippas Tsigas,et al.  Dynamic Load Balancing Using Work-Stealing , 2011 .

[4]  Rajeev Alur,et al.  An Axiomatic Memory Model for POWER Multiprocessors , 2012, CAV.

[5]  Jade Alglave,et al.  Fences in Weak Memory Models , 2010, CAV.

[6]  Peter Sewell,et al.  Mathematizing C++ concurrency , 2011, POPL '11.

[7]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  David A. Wood,et al.  Synchronization Using Remote-Scope Promotion , 2015, ASPLOS.

[9]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[10]  Lawrence Charles Paulson,et al.  Isabelle/HOL: A Proof Assistant for Higher-Order Logic , 2002 .

[11]  Margaret Martonosi,et al.  PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Jade Alglave,et al.  Synchronising C/C++ and POWER , 2012, PLDI.

[13]  John Wickerson,et al.  The Design and Implementation of a Verification Technique for GPU Kernels , 2015, TOPL.

[14]  Mark John Batty,et al.  The C11 and C++11 concurrency model , 2015 .

[15]  Francesco Zappa Nardelli,et al.  86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .

[16]  John Wickerson,et al.  Overhauling SC atomics in C11 and OpenCL , 2016, POPL.

[17]  Susmit Sarkar,et al.  Fast RMWs for TSO: semantics and implementation , 2013, PLDI.

[18]  Rajiv Gupta,et al.  Fence Scoping , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Derek Hower,et al.  HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models , 2015, TACO.

[20]  Jade Alglave,et al.  Understanding POWER multiprocessors , 2011, PLDI '11.

[21]  Peter Sewell,et al.  Clarifying and compiling C/C++ concurrency: from C++11 to POWER , 2012, POPL '12.