Cache coherence for GPU architectures

While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors (CMPs), GPU architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU applications. Moreover, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC) [34, 54], explored the use of time-based approaches in CMP coherence protocols. This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present an implementation of TC, called TC-Weak, which eliminates LCC's trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also find that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

[1]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[2]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[3]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[5]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[6]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[7]  Michel Dubois,et al.  Verification techniques for cache coherence protocols , 1997, CSUR.

[8]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[9]  Srinivas Devadas,et al.  Library Cache Coherence , 2011 .

[10]  Keshav Pingali,et al.  A GPU implementation of inclusion-based points-to analysis , 2012, PPoPP '12.

[11]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[12]  Robert Sims,et al.  Alpha architecture reference manual , 1992 .

[13]  Mike O'Connor,et al.  Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[14]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[15]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[16]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[19]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[20]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[22]  Vijayalakshmi Srinivasan,et al.  SPATL: Honey, I Shrunk the Coherence Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[23]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2009, IEEE Journal of Solid-State Circuits.

[24]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[25]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[26]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[27]  Kenneth B. Kent,et al.  The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.

[28]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[29]  Srinivas Devadas,et al.  Memory coherence in the age of multicores , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[30]  Alaa R. Alameldeen,et al.  Timestamp snooping: an approach for extending SMPs , 2000, SIGP.

[31]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[32]  John Giacomoni,et al.  FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.

[33]  Sang Lyul Min,et al.  Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[34]  Rami G. Melhem,et al.  A timestamp-based selective invalidation scheme for multiprocessor cache coherence , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[35]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[36]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[37]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[38]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[39]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[40]  David Seal,et al.  ARM Architecture Reference Manual , 2001 .

[41]  John R. Heath,et al.  Coherency Hub Design for Multisocket Sun Servers with CoolThreads Technology , 2009, IEEE Micro.

[42]  Sanjay J. Patel,et al.  Cohesion: a hybrid memory model for accelerators , 2010, ISCA.

[43]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[45]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[46]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[47]  P. J. Narayanan,et al.  CUDA cuts: Fast graph cuts on the GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[48]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[49]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[50]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[51]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.