Efficient, snoopless, System-on-Chip coherence

Coherence in a System-on-Chip (SoC) introduces complexity and overhead (snooping caches/directory, state bits, invalidations, etc.) in exchange for a clean and uniform shared memory model. As it is typical today, a SoC comprises a variety of cores with local caches, accelerators with local memories, and some form of shared last-level cache (LLC), all interconnected with shared buses. We propose a very simple coherence protocol, fit for this environment, that eliminates L1 snooping and its associated complexity and costs (power). In essence, we remove all coherence decisions from local caches by simply determining at the LLC whether data are private or shared. This makes a write-through policy a practical and effective alternative to maintain coherence. In the local caches, we dynamically select between writeback for private data, or write-through for shared data. Self-invalidation of the shared data on synchronization points eliminates the need to snoop, with just a data-race-free guarantee from software. Our evaluation shows that this simple protocol outperforms a traditional snooping protocol while at the same time significantly reducing L1, shared cache, and bus energy consumption.

[1]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[2]  José González,et al.  A new scalable directory architecture for large-scale multiprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[4]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  David J. Lilja,et al.  So many states, so little time: verifying memory coherence in the Cray X1 , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[8]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[9]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[10]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[11]  Michael C. Huang,et al.  POPS: Coherence Protocol Optimization for Both Private and Shared Data , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[12]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[13]  Seth H. Pugsley,et al.  SWEL: Hardware cache coherence protocols to map shared data onto shared caches , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Jaehyuk Huh,et al.  Subspace snooping: Filtering snoops with operating system support , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[16]  Rami G. Melhem,et al.  Compiler-assisted data distribution for chip multiprocessors , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  N. Binkert,et al.  Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[18]  Sandhya Dwarkadas,et al.  SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[20]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[21]  A.R. Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Guoying Chen SLID - A Cost-Effektive and Scalable Limited-Directory Scheme for Cache Coherence , 1993, PARLE.

[23]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[25]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.