Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

Current state-of-the-art on-chip networks provide efficiency, high throughput, and low latency for one-to-one (unicast) traffic. The presence of one-to-many (multicast) or one-to-all (broadcast) traffic can significantly degrade the performance of these designs, since they rely on multiple unicasts to provide one-to-many communication. This results in a burst of packets from a single source and is a very inefficient way of performing multicast and broadcast communication. This inefficiency is compounded by the proliferation of architectures and coherence protocols that require multicast and broadcast communication. In this paper, we characterize a wide array of on-chip communication scenarios that benefit from hardware multicast support. We propose Virtual Circuit Tree Multicasting (VCTM) and present a detailed multicast router design that improves network performance by up to 90% while reducing network activity (hence power) by up to 53%.Our VCTM router is flexible enough to improve interconnect performance for a broad spectrum of multicasting scenarios,and achieves these benefits with straightforward and inexpensive extensions to a state-of-the-art packet-switched router.

[1]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[2]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  Jian Liu,et al.  Interconnect intellectual property for Network-on-Chip (NoC) , 2004, J. Syst. Archit..

[5]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  C.B. Stunkel,et al.  A New Switch Chip for IBM RS/6000 SP Systems , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[8]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[9]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[10]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[11]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[12]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Dhabaleswar K. Panda,et al.  Efficient broadcast and multicast on multistage interconnection networks using multiport encoding , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[14]  Axel Jantsch,et al.  Connection-oriented multicasting in wormhole-switched networks on chip , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[15]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[16]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[17]  Jonathan S. Turner An optimal nonblocking multicast virtual circuit switch , 1994, Proceedings of INFOCOM '94 Conference on Computer Communications.

[18]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[19]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[20]  Lionel M. Ni,et al.  Multi-address Encoding for Multicast , 1994, PCRCW.

[21]  James Laudon,et al.  The SGI Origin: A ccNUMA Highly Scalable Server , 1997, ISCA.

[22]  Gurindar S. Sohi,et al.  Characterizing and predicting value degree of use , 2002, MICRO.

[23]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[24]  Doug Burger,et al.  Implementation and Evaluation of On-Chip Network Architectures , 2006, 2006 International Conference on Computer Design.

[25]  Josep Torrellas,et al.  An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[26]  Ki Hwan Yum,et al.  A Domain-Specific On-Chip Network Design for Large Scale Cache Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[27]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[28]  Milo M. K. Martin,et al.  Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors , 2003, ISCA '03.

[29]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[30]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[31]  Steven Swanson,et al.  Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.

[32]  Sharad Malik,et al.  Power-driven Design of Router Microarchitectures in On-chip Networks , 2003, MICRO.

[33]  John L. Klepeis,et al.  Anton, a special-purpose machine for molecular dynamics simulation , 2007, ISCA '07.

[34]  Kaivalya M. Dixit,et al.  The SPEC benchmarks , 1991, Parallel Comput..

[35]  Sharad Malik,et al.  Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[36]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[37]  Yuval Tamir,et al.  Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches , 1992, IEEE Trans. Computers.

[38]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[39]  Natalie D. Enright Jerger,et al.  An Evaluation of Server Consolidation Workloads for Multi-Core Designs , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[40]  Mikko H. Lipasti,et al.  Precise and Accurate Processor Simulation , 2002 .

[41]  Natalie D. Enright Jerger,et al.  Circuit-Switched Coherence , 2007, IEEE Computer Architecture Letters.

[42]  Li Shang,et al.  PowerHerd: dynamic satisfaction of peak power constraints in interconnection networks , 2003, ICS '03.

[43]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[44]  Mikko H. Lipasti,et al.  Circuit-Switched Coherence , 2008 .