Chasing Away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems

An unambiguous and easy-to-understand memory consistency model is crucial for ensuring correct synchronization and guiding future design of heterogeneous systems. In a widely adopted approach, the memory model guarantees sequential consistency (SC) as long as programmers obey certain rules. The popular data-race-free-0 (DRF0) model exemplifies this SC-centric approach by requiring programmers to avoid data races. Recent industry models, however, have extended such SC-centric models to incorporate relaxed atomics. These extensions can improve performance, but are difficult to specify formally and use correctly. This work addresses the impact of relaxed atomics on consistency models for heterogeneous systems in two ways. First, we introduce a new model, Data-Race-Free-Relaxed (DRFrlx), that extends DRF0 to provide SC-centric semantics for the common use cases of relaxed atomics. Second, we evaluate the performance of relaxed atomics in CPU-GPU systems for these use cases. We find mixed results — for most cases, relaxed atomics provide only a small benefit in execution time, but for some cases, they help significantly (e.g., up to 51% for DRFrlx over DRF0).

[1]  Hans-Juergen Boehm,et al.  Outlawing ghosts: avoiding out-of-thin-air results , 2014, MSPC@PLDI.

[2]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[3]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[4]  Tor M. Aamodt,et al.  Energy efficient GPU transactional memory via space-time optimizations , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  David Lang,et al.  N4215: Towards Implementation and Use of memory order consume , 2014 .

[6]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[7]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Scott Owens,et al.  Benchmarking weak memory models , 2016, PPOPP.

[9]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  Hans-Juergen Boehm Can seqlocks get along with programming language memory models? , 2012, MSPC '12.

[14]  Ori Lahav,et al.  Taming release-acquire consistency , 2016, POPL.

[15]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[16]  Viktor Vafeiadis,et al.  Relaxed separation logic: a program logic for C11 concurrency , 2013, OOPSLA.

[17]  P. McKenney Some Examples of Kernel-Hacker Informal Correctness Reasoning , 2015 .

[18]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[19]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[20]  Daniel Sánchez,et al.  Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  John Wickerson,et al.  Overhauling SC atomics in C11 and OpenCL , 2016, POPL.

[22]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Peter Sewell,et al.  The Problem of Programming Language Concurrency Semantics , 2015, ESOP.

[24]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[25]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Jeehoon Kang,et al.  A promising semantics for relaxed-memory concurrency , 2017, POPL.

[27]  Alastair F. Donaldson,et al.  Exposing errors related to weak memory in GPU applications , 2016, PLDI.

[28]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[29]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[30]  Matthew D. Sinclair,et al.  Porting CMP Benchmarks to GPUs , 2011 .

[31]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[32]  Ganesh Gopalakrishnan,et al.  Towards shared memory consistency models for GPUs , 2013, ICS '13.

[33]  Sarita V. Adve,et al.  Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[34]  Derek Hower,et al.  HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models , 2015, TACO.

[35]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[36]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[37]  Peter Sewell,et al.  A concurrency semantics for relaxed atomics that permits optimisation and avoids thin-air executions , 2016, POPL.

[38]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[39]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.

[40]  Anoop Gupta,et al.  Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..

[41]  D. K. Arvind,et al.  Languages and Compilers for Parallel Computing , 2014, Lecture Notes in Computer Science.

[42]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[43]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[44]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[46]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[47]  Sarita V. Adve,et al.  Memory models: a case for rethinking parallel languages and hardware , 2009, PODC '09.

[48]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[49]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[50]  David A. Wood,et al.  GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors , 2015, 2015 IEEE International Symposium on Workload Characterization.

[51]  Tyler Sorensen,et al.  ICS: U: Towards Shared Memory Consistency Models for GPUs , 2014 .

[52]  Jonathan Walpole,et al.  User-Level Implementations of Read-Copy Update , 2012, IEEE Transactions on Parallel and Distributed Systems.