Ruche Networks: Wire-Maximal, No-Fuss NoCs

Network-On-Chip design has been an active area of academic research for two decades, but many proposed ideas have not been adopted in real chips because they have complex behavior or create significant risks in chip implementation. For this reason, many existing chips just employ fast, replicated vanilla dimension-ordered mesh NoCs. However, these networks do not come close to utilizing the full available VLSI wiring capabilities, and propagate packets at speeds that are significantly below the raw speed of wires. The ideal network would not require any custom circuits, and would decompose easily into a hierarchical CAD flow consisting of a top-level design instantiating a mesh of identical hardened tiles with short-wire neighbor connections. At the same time, this ideal network would easily scale to efficiently utilize the majority of the available chip wiring resources, and would offer a mechanism for scaling this wire usage up or down based on available bandwidth. Packets would spend a significant fraction of their time in wire delay rather than router delay. Finally, the NoC would be simple to understand. This paper proposes Ruche Networks, which fulfill these requirements. They are based on simple 2-D mesh networks but amplify the NoC bandwidth and reduce NoC diameter of tiled architectures by adding long-range physical channels from each tile to other tiles on the same row or column. The more distant the connections, the greater the bandwidth of the network and the lower the diameter. The distance is typically increased until all of the physical VLSI wiring bandwidth have been absorbed. We explain the rational for this “ruching” and provide a simple methodology for designing and implementing these networks using a standard cell VLSI CAD flow. In this paper, we show the steps involved in ruching the HammerBlade Manycore’s mesh networks; these steps can easily apply to other designs.

[1]  Christopher Torng,et al.  The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips , 2018, IEEE Micro.

[2]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[3]  Andrew B. Kahng,et al.  Interconnect tuning strategies for high-performance ICs , 1998, DATE.

[4]  Christopher Torng,et al.  Evaluating Celerity: A 16-nm 695 Giga-RISC-V Instructions/s Manycore Processor With Synthesizable PLL , 2019, IEEE Solid-State Circuits Letters.

[5]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Andrew B. Kahng,et al.  Tuning Strategies for Global Interconnects in High-Performance Deep-Submicron ICs , 1999, VLSI Design.

[7]  Srinivas Devadas,et al.  Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[8]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[9]  Christopher Batten,et al.  Implementing Low-Diameter On-Chip Networks for Manycore Processors Using a Tiled Physical Design Methodology , 2020, 2020 14th IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[10]  David Wentzlaff,et al.  Power and Energy Characterization of an Open Source 25-Core Manycore Processor , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Puneet Gupta,et al.  Wire swizzling to reduce delay uncertainty due to capacitive coupling , 2004, 17th International Conference on VLSI Design. Proceedings..

[12]  William J. Dally,et al.  Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks , 1989, IEEE Trans. Computers.

[13]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.