论文信息 - Neat: Low-Complexity, Efficient On-Chip Cache Coherence

Neat: Low-Complexity, Efficient On-Chip Cache Coherence

Cache coherence protocols such as MESI that use writer-initiated invalidation have high complexity—and sometimes have poor performance and energy usage, especially under false sharing. Such protocols require numerous transient states, a shared directory, and support for core-to-core communication, while also suffering under false sharing. An alternative to MESI’s writer-initiated invalidation is self-invalidation, which achieves lower complexity than MESI but adds high performance costs or relies on programmer annotations or specific data access patterns. This paper presents Neat, a low-complexity, efficient cache coherence protocol. Neat uses self-invalidation, thus avoiding MESI’s transient states, directory, and core-to-core communication requirements. Neat uses novel mechanisms that effectively avoid many unnecessary self-invalidations. An evaluation shows that Neat is simple and has lower verification complexity than the MESI protocol. Neat not only outperforms state-of-theart self-invalidation protocols, but its performance and energy consumption are comparable to MESI’s, and it outperforms MESI under false sharing.

Michael D. Bond | Brandon Lucia | Vignesh Balaji | Rui Zhang | Swarnendu Biswas

[1] Sarita V. Adve,et al. Revisiting the Complexity of Hardware Cache Coherence and Some Implications , 2014, ACM Trans. Archit. Code Optim..

[2] Sarita V. Adve,et al. Parallel programming must be deterministic by default , 2009 .

[3] Sarita V. Adve,et al. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[4] Wenzhi Chen,et al. Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures , 2016, ICS.

[5] Sarita V. Adve,et al. Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Sarita V. Adve,et al. Memory models: a case for rethinking parallel languages and hardware , 2009, PODC '09.

[7] Jeffrey B. Rothman,et al. Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[8] Stefanos Kaxiras,et al. Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[10] Chen Tian,et al. PREDATOR: predictive false sharing detection , 2014, PPoPP '14.

[11] Andreas G. Veneris,et al. L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[13] David A. Wood,et al. Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Sarita V. Adve,et al. DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[15] Alan L. Cox,et al. A comparison of entry consistency and lazy release consistency implementations , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[16] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17] Hans-Juergen Boehm,et al. Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[18] Christoforos E. Kozyrakis,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19] Swarnendu Biswas,et al. Rethinking Support for Region Conflict Exceptions , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20] Alan L. Cox,et al. Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[21] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[22] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[23] Sarita V. Adve,et al. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[26] David A. Wood,et al. QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[27] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28] Brandon Lucia,et al. Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races , 2010, ISCA.

[29] Janak H. Patel,et al. A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[30] Stefanos Kaxiras,et al. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[31] Miguel Castro,et al. Efficient and flexible object sharing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[32] Brian N. Bershad,et al. Midway : shared memory parallel programming with entry consistency for distributed memory multiprocessors , 1991 .

[33] Marcelo Cintra,et al. An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[34] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[35] Stefanos Kaxiras,et al. Automatic detection of extended data-race-free regions , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36] Tanvir Ahmed Khan,et al. Huron: hybrid false sharing detection and repair , 2019, PLDI.

[37] Stefanos Kaxiras,et al. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[38] Alan J. Hu,et al. Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.

[39] Dan Grossman,et al. RADISH: Always-on sound and complete race detection in software and hardware , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).