An Efficient Multicast Router using Shared-Buffer with Packet Merging for Dataflow Architecture

Dataflow architecture has native advantages in achieving high instruction parallelism and power efficiency for today’s emerging applications such as high performance computing and deep neural network. For the dataflow computing, the execution of instructions is driven by data, so the data transfer efficiency of the network on chip (NoC) is a key factor affecting performance. In the NoC, the latest router uses the multicast routing scheme and output buffer structure to improve network transfer efficiency. However, the effective utilization rate of the router’s buffer is low due to the multicast transfer characteristics and unbalanced network load. This observation motivates us to design MRSB, a router architecture that effectively improves buffer utilization by allowing to share data and buffer resources among input ports. As the multicast packet is continuously split during transferring, the effective bandwidth utilization of the packet decreases. Packets with small size waste more buffer cell space, so we expanded packet merging based on MRSB according to the bandwidth occupied by different types of packets. For our experimental workloads, experimental results show that MRSB is 221.48% higher effective buffer utilization and 32.98% less latency than a state-of-the-art router with 31.39% smaller area and 29.14% lower power. The performance of the dataflow accelerator using MRSB is improved by 25.61%, and the average energy of experimental workloads is reduced by 24.27%.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Li-Shiuan Peh,et al.  Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Zhimin Zhang,et al.  Memory partition for SIMD in streaming dataflow architectures , 2016, 2016 Seventh International Green and Sustainable Computing Conference (IGSC).

[4]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[5]  Abdoulaye Gamatié,et al.  Distributed and dynamic shared-buffer router for high-performance interconnect , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[6]  Bevan M. Baas,et al.  RoShaQ: High-performance on-chip router with shared queues , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[7]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[8]  Axel Jantsch,et al.  Connection-oriented multicasting in wormhole-switched networks on chip , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[9]  Nikolaus A. Adams,et al.  Numerical simulation of fluid flow on complex geometries using the Lattice-Boltzmann method and CUDA-enabled GPUs , 2009, SIGGRAPH '09.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Xiaola Lin,et al.  Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers , 1994, IEEE Trans. Parallel Distributed Syst..

[12]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[13]  Wu-chun Feng,et al.  Towards a performance-portable FFT library for heterogeneous computing , 2014, Conf. Computing Frontiers.

[14]  Zhimin Zhang,et al.  An Efficient Network-on-Chip Router for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[15]  Frank Mueller,et al.  Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters , 2013, IEEE Transactions on Parallel and Distributed Systems.

[16]  Hicham G. Elmongui,et al.  Use of CUDA streams for block-based MPEG motion estimation on the GPU , 2012, SIGGRAPH '12.

[17]  Zhimin Zhang,et al.  A Non-Stop Double Buffering Mechanism for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[18]  Rui Xue,et al.  A Sharing Path Awareness Scheduling Algorithm for Dataflow Architecture , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[19]  Dongrui Fan,et al.  A Pipelining Loop Optimization Method for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[20]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[21]  Veljko M. Milutinovic,et al.  Guide to DataFlow Supercomputing , 2015, Computer Communications and Networks.

[22]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[23]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[24]  Dongrui Fan,et al.  SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[25]  Bill Lin,et al.  A High-Throughput Distributed Shared-Buffer NoC Router , 2009, IEEE Computer Architecture Letters.

[26]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).