In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects

Realizing scalable cache coherence in the many-core era comes with a whole new set of constraints and opportunities. It is widely believed that multi-hop, unordered on-chip networks would be needed in many-core chip multiprocessors (CMPs) to provide scalable on-chip communication. However, providing ordering among coherence transactions on unordered interconnects is a challenge. Traditional approaches for tackling coherence either have to use ordered interconnects (snoop-based protocols) which lead to scalability problems, or rely on an ordering point (directory-based protocols) which adds indirection latency. In this paper, we propose In-Network Snoop Ordering (INSO), in which coherence requests from a snoop-based protocol are inserted into the interconnect fabric and the network orders the requests in a distributed manner, creating a global ordering among requests. Essentially, when coherence requests enter the network, they grab snoop-orders at the injection router before being broadcasted. A snoop-order specifies the global ordering of the particular request with respect to other requests. Before requests reach their destinations, they get ordered along the way, at intermediate routers and destination network interfaces. Our logical ordering scheme can be mapped onto any unordered interconnect. This enables a cache coherence protocol which exploits the low-latency nature of unordered interconnects without adding indirection to coherence transactions. Our full-system evaluations compare INSO against a directory protocol and a broadcast based Token Coherence protocol. INSO outperforms these protocols by up to 30% and 8.5%, respectively, on a wide range of scientific and emerging applications.

[1]  David J. Schanin The design and development of a very high speed system bus—the encore Mutlimax nanobus , 1986 .

[2]  David J. Schanin The Design and Development of a Very High Speed System Bus - The Encore Multimax Nanobus , 1986, FJCC.

[3]  Andreas Nowatzyk,et al.  Coherent shared memory on a message passing machine , 1988 .

[4]  William J. Dally,et al.  Virtual-channel flow control , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[6]  Eric Williams,et al.  Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[7]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[8]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[9]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[10]  Paul F. Reynolds,et al.  Isotach Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[12]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[13]  Bronis R. de Supinski,et al.  Delta coherence protocols , 2000, IEEE Concurr..

[14]  Alaa R. Alameldeen,et al.  Timestamp snooping: an approach for extending SMPs , 2000, SIGP.

[15]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[16]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[17]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[18]  IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .

[19]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[20]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[21]  Jaehyuk Huh,et al.  TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP , 2004, TACO.

[22]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[23]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[24]  Li Shang,et al.  In-Network Cache Coherence , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[25]  Mark D. Hill,et al.  Coherence Ordering for Ring-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[26]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[27]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[28]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[29]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[30]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[32]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.