Dual partitioning multicasting for high-performance on-chip networks

As the number of cores integrated onto a single chip increases, power dissipation and network latency become ever-increasingly stringent. On-chip network provides an efficient and scalable interconnection paradigm for chip multiprocessors (CMPs), wherein one-to-many (multicast) communication is universal for such platforms. Without efficient multicasting support, traditional unicasting on-chip networks will be low efficiency in tackling such multicast communication. In this paper, we propose Dual Partitioning Multicasting (DPM) to reduce packet latency and balance network resource utilization. Specifically, DPM scheme adaptively makes routing decisions based on the network load-balance level as well as the link sharing patterns characterized by the distribution of the multicasting destinations. Extensive experimental results for synthetic traffic as well as real applications show that compared with the recently proposed RPM scheme, DPM significantly reduces the average packet latency and mitigates the network power consumption. More importantly, DPM is highly scalable for future on-chip networks with heavy traffic load and varieties of traffic patterns. Multicast traffic threatens the scalability of on-chip unicasting mechanisms.We propose Dual Partitioning Multicasting to balance the network link usage.DPM simultaneously yields high performance for unicast traffic.DPM effectively improves average packet latency and network power dissipation.DPM yields better scalability compared with previous work.

[1]  Gabriel Robins,et al.  New performance-driven FPGA routing algorithms , 1996, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[2]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[5]  Mainak Chaudhuri,et al.  Exploring virtual network selection algorithms in DSM cache coherence protocols , 2004, IEEE Transactions on Parallel and Distributed Systems.

[6]  José Duato,et al.  A new theory of deadlock-free adaptive multicast routing in wormhole networks , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[7]  David S. Johnson,et al.  The Rectilinear Steiner Tree Problem is NP Complete , 1977, SIAM Journal of Applied Mathematics.

[8]  Lionel M. Ni,et al.  Multi-address Encoding for Multicast , 1994, PCRCW.

[9]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[10]  Jianhua Li,et al.  LADPM: Latency-Aware Dual-Partition Multicast Routing for Mesh-Based Network-on-Chips , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[11]  Valentin Puente,et al.  MRR: Enabling fully adaptive multicast routing for CMP interconnection networks , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[13]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[14]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Lionel M. Ni,et al.  The turn model for adaptive routing , 1992, ISCA '92.

[16]  Solomon W. Golomb,et al.  Shift Register Sequences , 1981 .

[17]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  William J. Dally Virtual-Channel Flow Control , 1992, IEEE Trans. Parallel Distributed Syst..

[19]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[21]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[22]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Hyungjun Kim,et al.  Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[24]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[25]  Andrew B. Kahng,et al.  Trace-driven optimization of networks-on-chip configurations , 2010, Design Automation Conference.

[26]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[27]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28]  Feng Shi,et al.  Group-caching for NoC based multicore cache coherent systems , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[29]  DaeHo Seo,et al.  Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks , 2005, ISCA 2005.

[30]  Natalie D. Enright Jerger,et al.  Supporting efficient collective communication in NoCs , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[31]  Theodore R. Bashkow,et al.  A large scale, homogeneous, fully distributed parallel machine, I , 1977, ISCA '77.

[32]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[33]  José Duato,et al.  Efficient unicast and multicast support for CMPs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[34]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .