Neat: Low-Complexity, Efficient On-Chip Cache Coherence

Cache coherence protocols such as MESI that use writer-initiated invalidation have high complexity—and sometimes have poor performance and energy usage, especially under false sharing. Such protocols require numerous transient states, a shared directory, and support for core-to-core communication, while also suffering under false sharing. An alternative to MESI’s writer-initiated invalidation is self-invalidation, which achieves lower complexity than MESI but adds high performance costs or relies on programmer annotations or specific data access patterns. This paper presents Neat, a low-complexity, efficient cache coherence protocol. Neat uses self-invalidation, thus avoiding MESI’s transient states, directory, and core-to-core communication requirements. Neat uses novel mechanisms that effectively avoid many unnecessary self-invalidations. An evaluation shows that Neat is simple and has lower verification complexity than the MESI protocol. Neat not only outperforms state-of-theart self-invalidation protocols, but its performance and energy consumption are comparable to MESI’s, and it outperforms MESI under false sharing.

[1]  Sarita V. Adve,et al.  Revisiting the Complexity of Hardware Cache Coherence and Some Implications , 2014, ACM Trans. Archit. Code Optim..

[2]  Sarita V. Adve,et al.  Parallel programming must be deterministic by default , 2009 .

[3]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[4]  Wenzhi Chen,et al.  Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures , 2016, ICS.

[5]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Sarita V. Adve,et al.  Memory models: a case for rethinking parallel languages and hardware , 2009, PODC '09.

[7]  Jeffrey B. Rothman,et al.  Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[8]  Stefanos Kaxiras,et al.  Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[10]  Chen Tian,et al.  PREDATOR: predictive false sharing detection , 2014, PPoPP '14.

[11]  Andreas G. Veneris,et al.  L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[13]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[15]  Alan L. Cox,et al.  A comparison of entry consistency and lazy release consistency implementations , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[16]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[18]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19]  Swarnendu Biswas,et al.  Rethinking Support for Region Conflict Exceptions , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[21]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[22]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[23]  Sarita V. Adve,et al.  Spandex: A Flexible Interface for Efficient Heterogeneous Coherence , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[26]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[27]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Brandon Lucia,et al.  Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races , 2010, ISCA.

[29]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[30]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[31]  Miguel Castro,et al.  Efficient and flexible object sharing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[32]  Brian N. Bershad,et al.  Midway : shared memory parallel programming with entry consistency for distributed memory multiprocessors , 1991 .

[33]  Marcelo Cintra,et al.  An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[34]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[35]  Stefanos Kaxiras,et al.  Automatic detection of extended data-race-free regions , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36]  Tanvir Ahmed Khan,et al.  Huron: hybrid false sharing detection and repair , 2019, PLDI.

[37]  Stefanos Kaxiras,et al.  Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[38]  Alan J. Hu,et al.  Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.

[39]  Dan Grossman,et al.  RADISH: Always-on sound and complete race detection in software and hardware , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).