论文信息 - Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we present a comprehensive evaluation of a kind of cache coherence protocols that, despite having been already implemented during the 1990s for building large-scale commodity multiprocessors, have not been considered in the context of chip multiprocessors yet. In particular, we evaluate two directory-based cache coherence protocols based on the idea of having the sharing code of each memory block distributed between the different sharers (distributed sharing code). The first one employs simply-linked lists to encode the information about the sharers of the memory blocks, whilst the second one does the same using doubly-linked lists, which improves the management of replacements. We compare these two organizations with three protocols that use centralized sharing codes, each one having different directory memory overhead: one of them implementing a non-scalable bit-vector sharing code and the other two implementing more scalable limited-pointer schemes with one and two pointers, respectively. Simulation results show that for large-scale chip multiprocessors, the protocol based on distributed doubly-linked lists dramatically reduces the memory overhead of a non-scalable bit-vector directory, while at the same time it achieves its performance levels. This is achieved with just a small degradation on dynamic energy consumption (approximately 10 % on average). This way, our results point out that for manycores, coherence directories based on distributed sharing codes are appealing alternatives to contemporary coherence directories based on centralized sharing codes.

Alberto Ros | Manuel E. Acacio | Ricardo Fernández Pascual | Alberto Ros | M. Acacio

[1] Jung Ho Ahn,et al. How to simulate 1000 cores , 2009, CARN.

[2] David A. Wood,et al. Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[3] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[4] B. Delagi,et al. Distributed-directory scheme: Stanford distributed-directory protocol , 1990, Computer.

[5] N. Gura,et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[6] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[7] Balaram Sinharoy,et al. POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[8] Valentin Puente,et al. SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[9] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[10] Alberto Ros,et al. Scalable Directory Organization for Tiled CMP Architectures , 2008, CDES.

[11] Stein Gjessing,et al. Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[12] David B. Gustavson,et al. Scalable Coherent Interface , 1990, COMPEURO'90: Proceedings of the 1990 IEEE International Conference on Computer Systems and Software Engineering@m_Systems Engineering Aspects of Complex Computerized Systems.

[13] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[14] John L. Hennessy,et al. An evaluation of a commercial CC-NUMA architecture-the CONVEX Exemplar SPP1200 , 1997, Proceedings 11th International Parallel Processing Symposium.

[15] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[16] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] José González,et al. A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[18] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19] Anoop Gupta,et al. Parallel computer architecture - a hardware / software approach , 1998 .

[20] Arnab Banerjee,et al. An Energy and Performance Exploration of Network-on-Chip Architectures , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21] Michael Zhang,et al. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors , 2005, ISCA 2005.

[22] Krste Asanovic,et al. Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23] Pat Conway,et al. Blade computing with the AMD Opteron™ processor ("magny-cours") , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[24] Stéphan Jourdan,et al. Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[25] Babak Falsafi,et al. Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26] Henry Hoffmann,et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[27] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[29] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[30] Alberto Ros,et al. A scalable organization for distributed directories , 2010, J. Syst. Archit..

[31] Coniferous softwood. GENERAL TERMS , 2003 .

[32] T. Lovett,et al. STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[33] Alberto Ros,et al. Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs , 2014, Euro-Par Workshops.