Crossbar NoCs Are Scalable Beyond 100 Nodes

We describe the design and layout of a radix-128 crossbar in 90 nm CMOS. The data path is 32 bits wide and runs at 750 MHz using a three-stage pipeline, while fitting in a silicon area as small as 6.6 mm2 by filling it at the 90% level. The control path occupies 7 mm2 next to the data path by filling it at 35% level, and reconfigures the data path once every three clock cycles. Next, we arrange 128 1 mm2 “user tiles” around the crossbar, forming a 150 mm2 die, and we connect all tiles to the crossbar via global links running on top of the tiles. Including the overhead of repeaters and flip flops on global links, the area cost of the crossbar is 11% of the die. Thus, we prove that crossbar networks-on-chips (NoCs) are small enough for radices exceeding by far the few tens of ports, that were believed to be the practical limit up to now, and reaching above 100 ports. We also attempt a first-order comparison between our crossbar and a model of a popular mesh NoC, and we find that our crossbar NoC increases performance when traffic is global and stressed, at the cost of worse performance when traffic is local and benign. Finally, we present an experimental cost analysis showing that crossbar area practically grows as O(N2W), as all wiring of the crossbar fits over its standard cells, while crossbar delay grows as O(N√W) , as wire length increases with the perimeter of the crossbar.

[1]  Dean M. Tullsen,et al.  Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling , 2005, ISCA 2005.

[2]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[3]  Christopher Batten,et al.  Silicon-photonic clos networks for global on-chip communication , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[4]  Nick McKeown,et al.  Designing and implementing a fast crossbar scheduler , 1999, IEEE Micro.

[5]  Trevor Mudge,et al.  SWIFT: A 2.1Tb/s 32×32 self-arbitrating manycore interconnect fabric , 2011, 2011 Symposium on VLSI Circuits - Digest of Technical Papers.

[6]  Guang R. Gao,et al.  A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[8]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[9]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[10]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[11]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[12]  Dionisios N. Pnevmatikatos,et al.  VLSI micro-architectures for high-radix crossbar schedulers , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[13]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[14]  Luca Benini,et al.  Bringing NoCs to 65 nm , 2007, IEEE Micro.

[15]  Pedro López,et al.  Towards an efficient switch architecture for high-radix switches , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[16]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Dionisios N. Pnevmatikatos,et al.  A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles in a Single Hop and Occupying 6% of Their Area , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.