论文信息 - Complexity-effective multicore coherence

Complexity-effective multicore coherence

Much of the complexity and overhead (directory, state bits, invalidations) of a typical directory coherence implementation stems from the effort to make it “invisible” even to the strongest memory consistency model. In this paper, we show that a much simpler, directory-less/broadcast-less, multicore coherence can outperform a directory protocol but without its complexity and overhead. Motivated by recent efforts to simplify coherence, we propose a hardware approach that does not require any application guidance. The cornerstone of our approach is a dynamic, application-transparent, write-policy (write-back for private data, write-through for shared data), simplifying the protocol to just two stable states. Self-invalidation of the shared data at synchronization points allows us to remove the directory (and invalidations) completely, with just a data-race-free guarantee from software. This leads to our main result: a virtually costless coherence that outperforms a MESI directory protocol (by 4.8%) while at the same time reducing shared cache and network energy consumption (by 14.2%) for 15 parallel benchmarks, on 16 cores.

Stefanos Kaxiras | Alberto Ros | S. Kaxiras | Alberto Ros

[1] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[2] Babak Falsafi,et al. JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4] Michael C. Huang,et al. POPS: Coherence Protocol Optimization for Both Private and Shared Data , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[6] David Wentzlaff,et al. Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[7] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8] David J. Lilja,et al. So many states, so little time: verifying memory coherence in the Cray X1 , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[10] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11] Antonio Robles,et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[12] Seth H. Pugsley,et al. SWEL: Hardware cache coherence protocols to map shared data onto shared caches , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13] N. Binkert,et al. Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[14] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[15] Babak Falsafi,et al. Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16] Sandhya Dwarkadas,et al. SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17] Margaret Martonosi,et al. Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[18] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[19] A.R. Newton,et al. An empirical evaluation of two memory-efficient directory methods , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[20] Aoun Raza,et al. A Review of Race Detection Mechanisms , 2006, CSR.

[21] Guoying Chen. SLID - A Cost-Effektive and Scalable Limited-Directory Scheme for Cache Coherence , 1993, PARLE.

[22] José González,et al. A new scalable directory architecture for large-scale multiprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23] Jaehyuk Huh,et al. Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[24] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[25] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[26] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[27] Jaehyuk Huh,et al. Subspace snooping: Filtering snoops with operating system support , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28] Stefanos Kaxiras,et al. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[29] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30] Rami G. Melhem,et al. Compiler-assisted data distribution for chip multiprocessors , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31] Sarita V. Adve,et al. Shared Memory Consistency Models: A Tutorial , 1996, Computer.