Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence

Recent work has argued that sequential consistency (SC) in GPUs can perform on par with weak memory models, provided ordering stalls are made less frequent by relaxing ordering for private and read-only data. In this paper, we address the complementary problem of reducing stall latencies for both read-only and read-write data. We find that SC stalls are particularly problematic for workloads with inter-workgroup sharing, and occur primarily due to earlier stores in the same thread, a substantial part of the overhead comes from the need to stall until write permissions are obtained (to ensure write atomicity). To address this, we propose RCC, a GPU coherence protocol which grants write permissions without stalling but can still be used to implement SC. RCC uses logical timestamps to determine a global memory order and L1 read permissions, even though each core may see a different logical "time," SC ordering can still be maintained. Unlike previous GPU SC proposals, our design does not require invasive core changes and additional per-core storage to classify read-only/private data. For workloads with inter-workgroup sharing overall performance is 29% better and energy is 25% less than in best previous GPU SC proposals, and within 7% of the best non-SC design.

[1]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[2]  Srinivas Devadas,et al.  Tardis 2.0: Optimized time traveling coherence for relaxed consistency models , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[3]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[4]  Bob Bentley,et al.  Validating the Intel(R) Pentium(R) 4 microprocessor , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[5]  Randy H. Katz,et al.  Verifying a multiprocessor cache controller using random test generation , 1990, IEEE Design & Test of Computers.

[6]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[7]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[8]  Babak Falsafi,et al.  Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[9]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[10]  Alaa R. Alameldeen,et al.  Timestamp snooping: an approach for extending SMPs , 2000, SIGP.

[11]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[12]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[13]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[14]  Srinivas Devadas,et al.  TARDIS: Timestamp based Coherence Algorithm for Distributed Shared Memory , 2015, ArXiv.

[15]  Alan J. Hu,et al.  Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.

[16]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[17]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[18]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[19]  Srinivas Devadas,et al.  Library Cache Coherence , 2011 .

[20]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[21]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[23]  Satish Narayanasamy,et al.  zFENCE: Data-less Coherence for Efficient Fences , 2015, ICS.

[24]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[25]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[27]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[28]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[30]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[31]  Srinivas Devadas,et al.  A Proof of Correctness for the Tardis Cache Coherence Protocol , 2015, ArXiv.

[32]  Sebastian Burckhardt,et al.  Verifying Safety of a Token Coherence Implementation by Parametric Compositional Refinement , 2005, VMCAI.

[33]  Michel Dubois,et al.  Scalable Shared Memory Multiprocessors , 1992, Springer US.

[34]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[35]  Satish Narayanasamy,et al.  Efficiently enforcing strong memory ordering in GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[37]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[38]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[39]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[40]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[41]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[42]  Kenneth B. Kent,et al.  The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.

[43]  Michel Dubois,et al.  Verifying Distributed Directory-Based Cahce Coherence Protocols: S3.mp, a Case Study , 1995, Euro-Par.

[44]  Wenzhi Chen,et al.  Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures , 2016, ICS.

[45]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[46]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[47]  Mikko H. Lipasti,et al.  Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[48]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[49]  Jade Alglave,et al.  Understanding POWER multiprocessors , 2011, PLDI '11.

[50]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[51]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[52]  Snehasish Kumar,et al.  Fusion: Design tradeoffs in coherent cache hierarchies for accelerators , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[53]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[54]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[55]  Somesh Jha,et al.  Verification of the Futurebus+ cache coherence protocol , 1993, Formal Methods Syst. Des..

[56]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[57]  Mikko H. Lipasti,et al.  The complexity of verifying memory coherence , 2003, SPAA '03.

[58]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[59]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[60]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[61]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[62]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[63]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).