An analysis of on-chip interconnection networks for large-scale chip multiprocessors

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads. We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.

[1]  Ki Hwan Yum,et al.  A Domain-Specific On-Chip Network Design for Large Scale Cache Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[2]  Jun Yang,et al.  A low-radix and low-diameter 3D interconnection network design , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[3]  Belliappa Kuttanna,et al.  A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-Κ Metal Gate CMOS , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[4]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[5]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[6]  Chita R. Das,et al.  Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Jaehyuk Huh,et al.  A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[9]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Li Shang,et al.  In-Network Cache Coherence , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  Miltos D. Grammatikakis,et al.  NoC Topologies Exploration based on Mapping and Simulation Models , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[14]  Chita R. Das,et al.  A low latency router supporting adaptivity for on-chip interconnects , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[15]  George Michelogiannakis,et al.  An analysis of on-chip interconnection networks for large-scale chip multiprocessors , 2010, ACM Trans. Archit. Code Optim..

[16]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[18]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[19]  Miltos D. Grammatikakis,et al.  NoC Topologies Exploration based on Mapping and Simulation Models , 2007 .

[20]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[21]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Luca Benini,et al.  Networks on chip: a new paradigm for systems on chip design , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[23]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[24]  Sharad Malik,et al.  Power-driven Design of Router Microarchitectures in On-chip Networks , 2003, MICRO.

[25]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[26]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[28]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[29]  Niraj K. Jha,et al.  Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[30]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[31]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[32]  Hideharu Amano,et al.  Prediction router: Yet another low latency on-chip router architecture , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[33]  Karthik Ramani,et al.  Microarchitectural wire management for performance and power in partitioned architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[34]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[35]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[36]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[37]  José Duato,et al.  A new theory of deadlock-free adaptive multicast routing in wormhole networks , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[38]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[39]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[40]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[41]  William J. Dally,et al.  Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks , 1989, IEEE Trans. Computers.

[42]  N.K. Jha,et al.  Toward Ideal On-Chip Communication Using Express Virtual Channels , 2008, IEEE Micro.

[43]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[44]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[45]  Ulrich Ramacher,et al.  3D chip stack technology using through-chip interconnects , 2005, IEEE Design & Test of Computers.

[46]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[47]  Mario R. Casu,et al.  Implementation analysis of NoC: a MPSoC trace-driven approach , 2006, GLSVLSI '06.

[48]  William J. Dally,et al.  Virtual-channel flow control , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[49]  Todd M. Austin,et al.  Polymorphic On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[50]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[51]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[52]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[53]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..