论文信息 - A Cost-Efficient Router Architecture for HPC Inter-Connection Networks: Design and Implementation

A Cost-Efficient Router Architecture for HPC Inter-Connection Networks: Design and Implementation

High-radix routers with lower latency and higher bandwidth play an increasingly important role in constructing large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing higher throughput than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. Inspired by non-saturated throughput theory, this paper proposes a scalable router microarchitecture, termed Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into different tile arrays, thus the number of tiles and hardware overhead can be considerably reduced. For a radix-64 router MBTR achieves up to <inline-formula><tex-math notation="LaTeX">$50 \sim 75\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>50</mml:mn><mml:mo>∼</mml:mo><mml:mn>75</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="yi-ieq1-2873337.gif"/></alternatives><alternatives><mml:math><mml:mrow><mml:mn>50</mml:mn><mml:mo>∼</mml:mo><mml:mn>75</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="yi-ieq1-2873337.gif"/></alternatives></inline-formula> reduction in memory consumption as well as wire area compared with a hierarchical switch. We theoretically deduce the sufficient and necessary conditions for the asymmetrical crossbar to achieve un-saturated relative 100 percent throughput. Based on this observation we analyze the MBTR throughput and derive the condition that should be satisfied by the MBTR design parameters to yield 100 percent throughput. We further discuss how to make a trade-off between MBTR parameters based on the constraints of performance, power and area. The simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated a 36-port MBTR chip at 28 nm, providing 100 Gb/s bidirectional bandwidth per port, with a fall-through latency of just 30 ns. Internally it runs at 9.6 Tb/s, thus offering a speedup of <inline-formula><tex-math notation="LaTeX">$1.34\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>34</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="yi-ieq2-2873337.gif"/></alternatives><alternatives><mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>34</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="yi-ieq2-2873337.gif"/></alternatives></inline-formula>.

Kai Lu | Jinshu Su | Yi Dai | Liquan Xiao

[1] Jun Yang,et al. Simple virtual channel allocation for high throughput and high frequency on-chip routers , 2010, HPCA.

[2] Brick Stephenson. BlackWidow Hardware System Overview , 2006 .

[3] Nan Wu,et al. A Fast and Fair Shared Buffer for High-Radix Router , 2014, J. Circuits Syst. Comput..

[4] Mike Higgins,et al. Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Fang Ming,et al. Design of a Tile-based High-Radix Switch with High Throughput , .

[6] Jung Ho Ahn,et al. Network within a network approach to create a scalable high-radix router microarchitecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[7] Eiji Oki,et al. Concurrent round-robin-based dispatching schemes for Clos-network switches , 2002, TNET.

[8] János Sztrik,et al. Basic Queueing Theory , 2016 .

[9] Cyriel Minkenberg,et al. SCOC: High-radix switches made of bufferless clos networks , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10] Samuel P. Morgan,et al. Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..

[11] John Shalf,et al. DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[12] William J. Dally,et al. Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13] Lizhong Chen,et al. Worm-Bubble Flow Control , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[14] Nick McKeown,et al. The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[15] Cruz Izu,et al. The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[16] Larry Kaplan,et al. The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[17] Nick McKeown,et al. Matching output queueing with a combined input/output-queued switch , 1999, IEEE J. Sel. Areas Commun..