A composite and scalable cache coherence protocol for large scale CMPs

The number of on-chip cores of modern chip multiprocessors (CMPs) is growing fast with technology scaling. However, it remains a big challenge to efficiently support cache coherence for large scale CMPs. The conventional snoopy and directory coherence protocols cannot be smoothly scaled to many-core or thousand-core processors. Snoopy protocols introduce large power overhead due to enormous amount of cache tag probing triggered by broadcast. Directory protocols introduce performance penalty due to indirection, and large storage overhead due to storing directories. This paper addresses the efficiency problem when supporting cache coherency for large-scale CMPs. By leveraging emerging optical on-chip interconnect (OP-I) technology to provide high bandwidth density, low propagation delay and natural support for multicast/broadcast in a hierarchical network organization, we propose a composite cache coherence (C3) protocol that benefits from direct cache-to-cache accesses as in snoopy protocol and small amount of cache probing as in directory protocol. Targeting at quickly completing coherence transactions, C3 organizes accesses in a three-tier hierarchy by combining a mix of designs including local broadcast prediction, filtering, and a coarse-grained directory. Compared to directory-based protocol[18], our evaluations on a thousand-core CMP show that C3 improves performance by 21%, reduces network latency of coherence messages by 41% and saves network energy consumption by 5.5% on average for PARSEC applications.

[1]  Niraj K. Jha,et al.  In-Network Coherence Filtering: Snoopy coherence without broadcasts , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Christopher Batten,et al.  Silicon-photonic clos networks for global on-chip communication , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[3]  D.A.B. Miller,et al.  Rationale and challenges for optical interconnects to electronic chips , 2000, Proceedings of the IEEE.

[4]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[6]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[7]  Christopher Batten,et al.  Re-architecting DRAM memory systems with monolithically integrated silicon photonics , 2010, ISCA.

[8]  David H. Albonesi,et al.  Phastlane: a rapid transit optical routing network , 2009, ISCA '09.

[9]  Li Shang,et al.  In-Network Cache Coherence , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Marc Tremblay,et al.  Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[11]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[12]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[13]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[14]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[15]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[16]  Yu Zhang,et al.  Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.

[17]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[19]  Jung Ho Ahn,et al.  Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[20]  Coniferous softwood GENERAL TERMS , 2003 .

[21]  Christopher Batten,et al.  Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[22]  Alyssa B. Apsel,et al.  Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  Anoop Gupta,et al.  Design of scalable shared-memory multiprocessors: the DASH approach , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[24]  Krisztián Flautner,et al.  PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[25]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Timothy Mark Pinkston,et al.  A hybrid cache coherence protocol for a decoupled multi-channel optical network: SPEED DMON , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[27]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[28]  Peng Liu,et al.  An intra-chip free-space optical interconnect , 2010, ISCA.

[29]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[30]  José González,et al.  The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[31]  Nevin Kirman,et al.  A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS 2010.

[32]  N. Binkert,et al.  Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[33]  Rajeev Balasubramonian,et al.  Towards scalable, energy-efficient, bus-based on-chip networks , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[34]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[35]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[36]  John Kim,et al.  FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[37]  Alan J. Hu,et al.  Improving multiple-CMP systems using token coherence , 2005, 11th International Symposium on High-Performance Computer Architecture.

[38]  Sanjay J. Patel,et al.  WayPoint: Scaling coherence to 1000-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[40]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[41]  Li Shang,et al.  Spectrum: A hybrid nanophotonic—electric on-chip network , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[42]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[43]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[44]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[45]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[46]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).