A scalable organization for distributed directories

Although directory-based cache-coherence protocols are the best choice when designing chip multiprocessors with tens of cores on-chip, the memory overhead introduced by the directory structure may not scale gracefully with the number of cores. Many approaches aimed at improving the scalability of directories have been proposed. However, they do not bring perfect scalability and usually reduce the directory memory overhead by compressing coherence information, which in turn results in extra unnecessary coherence messages and, therefore, wasted energy and some performance degradation. In this work, we present a distributed directory organization based on duplicate tags for tiled CMP architectures whose size is independent on the number of tiles of the system up to a certain number of tiles. We demonstrate that this number of tiles corresponds to the number of sets in the private caches. Additionally, we show that the area overhead of the proposed directory structure is 0.56% with respect to the on-chip data caches. Moreover, the proposed directory structure keeps the same information than a non-scalable full-map directory. Finally, we propose a mechanism that takes advantage of this directory organization to remove the network traffic caused by replacements. This mechanism reduces total traffic by 15% for a 16-core configuration compared to a traditional directory-based protocol.

[1]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[2]  Maged M. Michael,et al.  High-throughout coherence control and hardware messaging in Everest , 2001, IBM J. Res. Dev..

[3]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[4]  Valentin Puente,et al.  SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[5]  Michael Zhang,et al.  Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors , 2005, ISCA 2005.

[6]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[7]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[8]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[10]  Mark D. Hill,et al.  An evaluation of directory protocols for medium-scale shared-memory multiprocessors , 1994, ICS '94.

[11]  Sharad Malik,et al.  Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[12]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[13]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[14]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[15]  A. Richard Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.

[16]  Jinseok Kong,et al.  Minimizing the Directory Size for Large-Scale Shared-Memory Multiprocessors , 2005, IEICE Trans. Inf. Syst..

[17]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[18]  Alberto Ros,et al.  A Novel Lightweight Directory Architecture for Scalable Shared-Memory Multiprocessors , 2005, Euro-Par.

[19]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20]  Mani Azimi,et al.  Integration Challenges and Tradeoffs for Terascale Architectures , 2007 .

[21]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[22]  Alberto Ros,et al.  Scalable Directory Organization for Tiled CMP Architectures , 2008, CDES.

[23]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[24]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[25]  José González,et al.  A new scalable directory architecture for large-scale multiprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[26]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[27]  Akhilesh Kumar,et al.  Integration Challenges and Tradeoffs for Tera-scale Architectures I l ® chnology , 2007 .

[28]  Mark Horowitz Dynamic Pointer Allocation for Scalable Cache Coherence Directories , 1991 .

[29]  Uri C. Weiser,et al.  Interconnect-power dissipation in a microprocessor , 2004, SLIP '04.

[30]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[32]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[33]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .