Directory Lookaside Table: Enabling scalable, low-conflict, many-core cache coherence directory

Maintaining hardware cache coherence on future CMPs becomes increasingly important and difficult as the number of cores keeps accelerating in mainstream multicore chips. The simple snooping-bus coherence scheme is not suitable due to its limited scalability. The sparse coherence directory approach may incur extra cache invalidations due to a topological mismatch between the coherence directory and the directories of all cache modules. In this paper, we propose an innovative CMP coherence directory that has three important properties. First, the directory has a simple set-associative design with small associativity. The number of directory entries matches the total number of cache blocks. Second, an augmented Directory Lookaside Table (DLT) allows blocks to be displaced from their primary sets in the coherence directory for alleviating hot-set conflicts. Third, to avoid expensive presence bits, each copy of a block along with the located core ID occupies a separate directory entry. Performance evaluations based on multithreaded and multi-programmed workloads demonstrate significant advantages of the proposed CMP directory over directories with traditional set-associative or skewed associative designs.

[1]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[2]  Vijayalakshmi Srinivasan,et al.  SPATL: Honey, I Shrunk the Coherence Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Gitae Jeong,et al.  What Lies Ahead for Resistance-Based Memory Technologies? , 2013, Computer.

[4]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[6]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[7]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[8]  A. Richard Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.

[9]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[10]  Laxmi N. Bhuyan,et al.  Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors , 1992, IEEE Trans. Parallel Distributed Syst..

[11]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[12]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[13]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[14]  Dean M. Tullsen,et al.  Proximity-aware directory-based coherence for multi-core processor architectures , 2007, SPAA '07.

[15]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[16]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[17]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[19]  Basilio B. Fraguela,et al.  Adaptive line placement with the set balancing cache , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[21]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  François Bodin,et al.  Skewed Associativity Improves Program Performance and Enhances Predictability , 1997, IEEE Trans. Computers.

[23]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[24]  Suleyman Sair,et al.  Memory Behavior of the SPEC 2000 Benchmark Suite , 2000 .

[25]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[26]  Calvin K. Tang Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[27]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[28]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[29]  Bernard Chazelle,et al.  The Bloomier filter: an efficient data structure for static support lookup tables , 2004, SODA '04.

[30]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.