Supporting efficient collective communication in NoCs

Across many architectures and parallel programming paradigms, collective communication plays a key role in performance and correctness. Hardware support is necessary to prevent important collective communication from becoming a system bottleneck. Support for multicast communication in Networks-on-Chip (NoCs) has achieved substantial throughput improvements and power savings. In this paper, we explore support for reduction or many-to-one communication operations. As a case study, we focus on acknowledgement messages (ACK) that must be collected in a directory protocol before a cache line may be upgraded to or installed in the modified state. This paper makes two primary contributions: an efficient framework to support the reduction of ACK packets and a novel Balanced, Adaptive Multicast (BAM) routing algorithm. The proposed message combination framework complements several multicast algorithms. By combining ACK packets during transmission, this framework not only reduces packet latency by 14.1% for low-to-medium network loads, but also improves the network saturation throughput by 9.6% with little overhead. The balanced buffer resource configuration of BAM improves the saturation throughput by an additional 13.8%. For the PARSEC benchmarks, our design offers an average speedup of 12.7% and a maximal speedup of 16.8%.

[1]  Natalie D. Enright Jerger SigNet: Network-on-chip filtering for coarse vector directories , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[2]  Karthikeyan Sankaralingam,et al.  Implementation and Evaluation of a Dynamically Routed Processor Operand Network , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[3]  Chita R. Das,et al.  A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[4]  Jeffrey T. Draper,et al.  Multicast routing with dynamic packet fragmentation , 2009, GLSVLSI '09.

[5]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[7]  Li-Shiuan Peh,et al.  Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[9]  Natalie D. Enright Jerger,et al.  DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[10]  José Duato,et al.  Efficient unicast and multicast support for CMPs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[11]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[12]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[13]  Ieee Xiang,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011 .

[14]  Valentin Puente,et al.  MRR: Enabling fully adaptive multicast routing for CMP interconnection networks , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[15]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[16]  Natalie D. Enright Jerger,et al.  On-Chip Networks , 2009, On-Chip Networks.

[17]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18]  Rahul Boyapati,et al.  Efficient lookahead routing and header compression for multicasting in networks-on-chip , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[19]  Axel Jantsch,et al.  Connection-oriented multicasting in wormhole-switched networks on chip , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[22]  Yingtao Jiang,et al.  On an efficient NoC multicasting scheme in support of multiple applications running on irregular sub-networks , 2011, Microprocess. Microsystems.

[23]  Ran Ginosar,et al.  The Power of Priority: NoC Based Distributed Cache Coherency , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[24]  Chita R. Das,et al.  A low latency router supporting adaptivity for on-chip interconnects , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[25]  Hyungjun Kim,et al.  Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[26]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[27]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[28]  George Michelogiannakis,et al.  Evaluating Bufferless Flow Control for On-chip Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[29]  Manfred Glesner,et al.  New Theory for Deadlock-Free Multicast Routing in Wormhole-Switched Virtual-Channelless Networks-on-Chip , 2011, IEEE Transactions on Parallel and Distributed Systems.

[30]  Dhabaleswar K. Panda Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms , 1995, Future Gener. Comput. Syst..

[31]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[32]  Hong Xu,et al.  Efficient implementation of barrier synchronization in wormhole-routed hypercube multicomputers , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[33]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[34]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[35]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[36]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[37]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[38]  Milos Prvulovic,et al.  TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[39]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[40]  Lionel M. Ni,et al.  Multi-address Encoding for Multicast , 1994, PCRCW.

[41]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.