TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core Architectures

As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The standard directory-based coherence does not scale well with increasing core count. Timestamp-based hardware coherence protocols introduced recently offer an attractive alternative solution. This paper proposes a timestamp-based coherence protocol, called TC-Release++ , that efficiently supports cache coherence in large-scale systems. Our approach is inspired by TC-Weak, a recently proposed timestamp-based coherence protocol targeting GPU architectures. We first design TC-Release in an attempt to straightforwardly port TC-Weak to general-purpose many-cores. But re-purposing TC-Weak for general-purpose many-core architectures is challenging due to significant differences both in architecture and the programming model. Indeed the performance of TC-Release turns out to be worse than conventional directory protocols. We overcome the limitations and overheads of TC-Release by exploiting simple hardware support to eliminate frequent memory stalls, and an optimized lifetime prediction mechanism to improve cache performance. The resulting optimized coherence protocol TC-Release++ is highly scalable (storage scales logarithmically with core count) and shows better performance (3.0 percent) and comparable network traffic (within 1.3 percent) relative to the baseline MESI directory protocol. We use Murphi to formally verify that TC-Release ++ is error-free and imposes small verification cost.

[1]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[2]  Stefanos Kaxiras,et al.  Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[3]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[4]  Paul Gastin,et al.  Avoiding State Explosion for Distributed Systems with Timestamps , 2001, FME.

[5]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[6]  Srinivas Devadas,et al.  Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[7]  S. K. Nandy,et al.  An Incessantly Coherent Cache Scheme for SharedMemory Multithreaded , 1994 .

[8]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[9]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[10]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  David L. Dill,et al.  The Murphi Verification System , 1996, CAV.

[12]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Tatsuhiro Tsuchiya,et al.  Model Checking of Consensus Algorit , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[16]  Bratin Saha,et al.  McRT-STM: a high performance software transactional memory system for a multi-core runtime , 2006, PPoPP '06.

[17]  Vijay Nagarajan,et al.  TSO-CC: Consistency directed cache coherence for TSO , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[18]  Thomas J. Ashby,et al.  Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters , 2011, IEEE Transactions on Computers.

[19]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Kenneth L. McMillan,et al.  Parameterized Verification of the FLASH Cache Coherence Protocol by Compositional Model Checking , 2001, CHARME.

[22]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[23]  Vijay Nagarajan,et al.  RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[24]  Sang Lyul Min,et al.  Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[25]  Rami G. Melhem,et al.  A timestamp-based selective invalidation scheme for multiprocessor cache coherence , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[26]  Srinivas Devadas,et al.  Memory coherence in the age of multicores , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[27]  M. Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[28]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[29]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[30]  John Goodacre,et al.  Parallelism and the ARM instruction set architecture , 2005, Computer.

[31]  Helmut Veith,et al.  Progress on the State Explosion Problem in Model Checking , 2001, Informatics.

[32]  Wojciech Penczek,et al.  Verifying Security Protocols with Timestamps via Translation to Timed Automata ⋆ , 2005 .

[33]  Jaehyuk Huh,et al.  Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[34]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[35]  Srinivas Devadas,et al.  Tardis 2.0: Optimized time traveling coherence for relaxed consistency models , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[36]  Tatsuhiro Tsuchiya,et al.  Model Checking of Consensus Algorithms , 2007 .

[37]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[38]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Balaram Sinharoy,et al.  The implementation of POWER7TM: A highly parallel and scalable multi-core high-end server processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[40]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[41]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[42]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.