SPATL: Honey, I Shrunk the Coherence Directory

One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a Tagless directory, which uses bloom filters to summarize the tags in a cache set. The Tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup and reduce the directory overhead. However, Tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and noninclusive shared caches and provides 34% storage savings over Tagless, the previous most storage-efficient directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with Tagless, and demonstrate that SPATL can achieve the same level of false sharers as Tagless with 5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024-core chips with less than 1% of the private cache space for data parallel applications.

[1]  Sandhya Dwarkadas,et al.  SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[3]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[7]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  A. Richard Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.

[9]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[10]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[12]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[13]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .

[14]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[15]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[16]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[17]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[18]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[19]  Kyu Ho Park,et al.  Segment directory enhancing the limited directory cache coherence schemes , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.