论文信息 - Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies

Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies

Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keeping the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption. We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through - now restricted to operate within the cluster where a data block is shared. We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to a hierarchical directory-based MOESI, VIPS-M, and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and decreased network traffic.

Stefanos Kaxiras | Alberto Ros | Mahdad Davari

[1] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[2] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3] Thomas J. Ashby,et al. Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters , 2011, IEEE Transactions on Computers.

[4] Charles E. Leiserson,et al. A consistency architecture for hierarchical shared caches , 2008, SPAA '08.

[5] Mohammad Alisafaee. Spatiotemporal Coherence Tracking , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[6] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[7] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8] Michael C. Huang,et al. POPS: Coherence Protocol Optimization for Both Private and Shared Data , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[9] David A. Wood,et al. Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[10] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[11] James R. Larus,et al. Mechanisms for cooperative shared memory , 1993, ISCA '93.

[12] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[13] Mark D. Hill,et al. Virtual Hierarchies , 2008, IEEE Micro.

[14] Michael Butler,et al. Bulldozer: An Approach to Multithreaded Compute Performance , 2011, IEEE Micro.

[15] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[16] Alan J. Hu,et al. Improving multiple-CMP systems using token coherence , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17] Antonio Robles,et al. Temporal-Aware Mechanism to Detect Private Data in Chip Multiprocessors , 2013, 2013 42nd International Conference on Parallel Processing.

[18] Antonio Robles,et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19] Andrew W. Wilson,et al. Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[20] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[21] Babak Falsafi,et al. Multi-grain coherence directories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[23] Li Shang,et al. In-Network Cache Coherence , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[24] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25] Jaehyuk Huh,et al. Subspace snooping: Filtering snoops with operating system support , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26] Stefanos Kaxiras,et al. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[27] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[28] N. Gura,et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[29] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30] Natalie D. Enright Jerger,et al. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.

[33] Dhiraj K. Pradhan,et al. Two economical directory schemes for large-scale cache coherent multiprocessors , 1991, CARN.

[34] Seth H. Pugsley,et al. SWEL: Hardware cache coherence protocols to map shared data onto shared caches , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35] Babak Falsafi,et al. Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[36] Meng Zhang,et al. Fractal Coherence: Scalably Verifiable Cache Coherence , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[37] Stefanos Kaxiras,et al. A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[38] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[39] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[40] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[41] Per Stenström,et al. The Scalable Tree Protocol-a cache coherence approach for large-scale multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[42] Stefanos Kaxiras,et al. The GLOW cache coherence protocol extensions for widely shared data , 1996, ICS '96.

[43] Milo M. K. Martin,et al. Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[44] Sarita V. Adve,et al. DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.