MRR: Enabling fully adaptive multicast routing for CMP interconnection networks

On-network hardware support for multi-destination traffic is a desirable feature in most multiprocessor machines. Multicast hardware capabilities enable much more effective bandwidth utilization as multi-destination packets do not need to repeatedly use the same resources, as occurs when multicast traffic must be decomposed in unicast packets. Although Chip Multiprocessors are not an exception in this interest, up to date, few fitting proposals exist. The combination of the scarcity of available resources and the common idea that multicast support requires a substantial amount of extra resources is responsible for this situation. In this work, we propose a new approach suitable for on-chip networks capable of managing multi-destination traffic via hardware in an efficient way with negligible complexity. We introduce the Multicast Rotary Router (MRR), a router able to: (1) perform on-network multicast support with almost zero cost over the Rotary Router, (2) use a fully adaptive tree to distribute multicast traffic, (3) perform on-network congestion control extending network utilization range. The performance results, using a state-of-the-art full system simulation framework, show that it improves average full system performance of a CMP using a unicast Rotary Router in its interconnection network by 25%, and an input buffered router with multicast support by 20%.

[1]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[2]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[3]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[4]  Leonard Kleinrock,et al.  Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.

[5]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[6]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[8]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[9]  Valentin Puente,et al.  Rotary router: an efficient architecture for CMP interconnection networks , 2007, ISCA '07.

[10]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[11]  Jonathan S. Turner An optimal nonblocking multicast virtual circuit switch , 1994, Proceedings of INFOCOM '94 Conference on Computer Communications.

[12]  Valentin Puente,et al.  Reducing the Interconnection Network Cost of Chip Multiprocessors , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).

[13]  Federico Silla,et al.  A comparative study of arbitration algorithms for the Alpha 21364 pipelined router , 2002, ASPLOS X.

[14]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  Valentin Puente,et al.  SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[16]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[17]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[18]  Lionel M. Ni,et al.  Multi-address Encoding for Multicast , 1994, PCRCW.

[19]  C.B. Stunkel,et al.  A New Switch Chip for IBM RS/6000 SP Systems , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20]  Pradip K. Srimani,et al.  A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes , 2001, IEEE Trans. Computers.

[21]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[22]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[23]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[24]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[25]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[26]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[27]  José Duato,et al.  Adaptive bubble router: a design to improve performance in torus networks , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[28]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[29]  Dhabaleswar K. Panda,et al.  Efficient broadcast and multicast on multistage interconnection networks using multiport encoding , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[30]  Josep Torrellas,et al.  An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[31]  Xiaola Lin,et al.  Deadlock-free multicast wormhole routing in multicomputer networks , 1991, ISCA '91.