High-Radix Scalable Modular Crossbar Switches

As process technologies have scaled, the increasing number of processor cores and memorieson a single die has also driven the need for more complex on-chip interconnection networks.Crossbar switches are primary building blocks in such networks-on-chip, as they can be usedas fast single-stage networks or as the core of the router switch in multi-stage networks.While crossbars offer non-blocking, single-hop, all-to-all communication, they tend to scalepoorly with the number of nodes due to the latency and energy of the long wires and highradixmultiplexor structures needed. In this work, we investigate how to improve crossbarperformance, energy-efficiency, and scalability.To better understand the design space and scaling limitations, we have developed an on chipswitch modeling tool calibrated using circuit-level simulations. The tool enables a designspace exploration showing how area, power, and performance vary across radix, data width,wire parameters, and circuit implementation. In addition to conventional design options,we examined capacitively coupled low-swing signaling to improve to energy consumption ofthe I/O wires. This exploration shows that the main bottlenecks are the long I/O wires andthe key to improving the performance and efficiency is to minimize the area. Using theseinsights, we present modular crossbar switches that can perform better at high radices thanthe monolithic designs. The modular sub-blocks are arranged in a controlled flow-through,pipelined scheme to eliminate global connections and maintain linear performance scalingand high throughput. Modularity also enables energy savings via deactivation of unusedI/O wires.To evaluate our design, we implemented a prototype radix-64 modular crossbar switchtestcip in 40nm CMOS bulk process. The testchip operates at 2.38GHz at 1V nominalsupply voltage and consumes 1.2W power. It offers 2.2X better throughput and 2.4X betterenergy-efficiency than published state of the art designs. We further evaluated modularcrossbar networks with the proposed crevaluation tool. The proposed design achieves more than 90% saturation throughput withan internal speed up of 1.5, supports high data line rates, and offers lower average networklatency compared to conventional crossbars. Evaluation results show that modular crossbarsare scalable to high-radices while still offering high-performance, energy-efficiency and onehopsimplicity.ossbar switches using BookSim2, a network on chip

[1]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[2]  Jaeha Kim,et al.  A low-power high-radix switch fabric based on low-swing signaling and partially-activated input lines , 2013, 2013 International Symposium onVLSI Design, Automation, and Test (VLSI-DAT).

[3]  William J. Dally,et al.  CMOS high-speed I/Os - present and future , 2003, Proceedings 21st International Conference on Computer Design.

[4]  Γεώργιος Πασσάς,et al.  VLSI micro-architectures for high-radix crossbars , 2012 .

[5]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  James C. Hoe,et al.  CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs , 2012, FPGA '12.

[7]  Ken Christensen,et al.  An evolution to crossbar switches with virtual output queuing and buffered cross points , 2003 .

[8]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[9]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[10]  Michael Taylor A landscape of the new dark silicon design regime , 2013 .

[11]  Mark J. Karol,et al.  Queueing in high-performance packet switching , 1988, IEEE J. Sel. Areas Commun..

[12]  Samuel P. Morgan,et al.  Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..

[13]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[14]  Bo Xu Design of an expandable crossbar scheduler based on iSLIP algorithm , 2003, International Conference on Communication Technology Proceedings, 2003. ICCT 2003..

[15]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[16]  Ron Ho,et al.  Modeling and Design of High-Radix On-Chip Crossbar Switches , 2015, NOCS.

[17]  Jung Ho Ahn,et al.  Network within a network approach to create a scalable high-radix router microarchitecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[18]  Chen Sun,et al.  DELPHI: a framework for RTL-based architecture design evaluation using DSENT models , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[19]  Hui Zhang,et al.  Exact emulation of an output queueing switch by a combined input output queueing switch , 1998, 1998 Sixth International Workshop on Quality of Service (IWQoS'98) (Cat. No.98EX136).

[20]  Li-Shiuan Peh,et al.  SWIFT: A Low-Power Network-On-Chip Implementing the Token Flow Control Router Architecture With Swing-Reduced Interconnects , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  David Blaauw,et al.  High radix self-arbitrating switch fabric with multiple arbitration schemes and quality of service , 2012, DAC Design Automation Conference 2012.

[22]  Nicholas Bambos,et al.  Backlog Aware Scheduling for Ingress Memories in High-Radix, Single-Stage Switches , 2009, GLOBECOM 2009 - 2009 IEEE Global Telecommunications Conference.

[23]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Dionisios N. Pnevmatikatos,et al.  The Combined Input-Output Queued Crossbar Architecture for High-Radix On-Chip Switches , 2015, IEEE Micro.

[25]  Eiji Oki,et al.  CIXOB-k: combined input-crosspoint-output buffered packet switch , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[26]  Li-Shiuan Peh,et al.  SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS , 2010, 2010 IEEE International Conference on Computer Design.

[27]  Ming Cong,et al.  Scaling the Performance of Tiled Processor Architectures with On-Chip-Network Topology , 2009, 2009 International Joint Conference on Computational Sciences and Optimization.

[28]  Justin Schauer,et al.  High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE Journal of Solid-State Circuits.

[29]  Mark Horowitz,et al.  High-speed electrical signaling: overview and limitations , 1998, IEEE Micro.

[30]  Ivan E. Sutherland,et al.  Logical effort: designing for speed on the back of an envelope , 1991 .

[31]  Shasi Kumar,et al.  A 2Tb/s 6×4 mesh network with DVFS and 2.3Tb/s/W router in 45nm CMOS , 2010, 2010 Symposium on VLSI Circuits.

[32]  Ron Ho,et al.  High-efficiency crossbar switches using capacitively coupled signaling , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[33]  William J. Dally,et al.  Allocator implementations for network-on-chip routers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[34]  Nick McKeown,et al.  Designing and implementing a fast crossbar scheduler , 1999, IEEE Micro.

[35]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[36]  Cyriel Minkenberg,et al.  SCOC: High-radix switches made of bufferless clos networks , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[37]  David Blaauw,et al.  High-bandwidth and low-energy on-chip signaling with adaptive pre-emphasis in 90nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[38]  William J. Dally,et al.  High-radix interconnection networks , 2008 .

[39]  Nick McKeown,et al.  The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[40]  Nick McKeown,et al.  Matching output queueing with a combined input/output-queued switch , 1999, IEEE J. Sel. Areas Commun..

[41]  Sharad Malik,et al.  Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[42]  David Harris,et al.  CMOS VLSI Design: A Circuits and Systems Perspective , 2004 .

[43]  Nick McKeown,et al.  A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch , 1999 .

[44]  Feng Wang,et al.  Fast fair arbiter design in packet switches , 2005, HPSR. 2005 Workshop on High Performance Switching and Routing, 2005..

[45]  Eiji Oki,et al.  Performance analysis on dynamics of parallel iterative matching in an input-buffered switch , 2009, 2009 15th Asia-Pacific Conference on Communications.

[46]  Eisse Mensink,et al.  A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-Chip interconnects , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[47]  Dionisios N. Pnevmatikatos,et al.  VLSI micro-architectures for high-radix crossbar schedulers , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[48]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[49]  Trevor Mudge,et al.  SWIFT: A 2.1Tb/s 32×32 self-arbitrating manycore interconnect fabric , 2011, 2011 Symposium on VLSI Circuits - Digest of Technical Papers.

[50]  V.G. Oklobdzija,et al.  Improved sense-amplifier-based flip-flop: design and measurements , 2000, IEEE Journal of Solid-State Circuits.

[51]  Pedro López,et al.  Towards an efficient switch architecture for high-radix switches , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[52]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[53]  Sudhir Kumar Satpathy High Performance and Low Power On-Die Interconnect Fabrics , 2012 .

[54]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[55]  Chi-Ying Tsui,et al.  A 2 Gb/s 256*256 CMOS crossbar switch fabric core design using pipelined MUX , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[56]  Dionisios N. Pnevmatikatos,et al.  Crossbar NoCs Are Scalable Beyond 100 Nodes , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[57]  Eisse Mensink,et al.  Low-Power, High-Speed Transceivers for Network-on-Chip Communication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[58]  Thomas E. Anderson,et al.  High-speed switch scheduling for local-area networks , 1993, TOCS.

[59]  Nick McKeown,et al.  The throughput of a buffered crossbar switch , 2005, IEEE Communications Letters.

[60]  George F. Riley,et al.  Round-robin Arbiter Design and Generation , 2002, 15th International Symposium on System Synthesis, 2002..

[61]  Reetuparna Das,et al.  Design and Evaluation of Hierarchical Rings with Deflection Routing , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[62]  R. Schaller,et al.  Moore's law: past, present and future , 1997 .

[63]  Y. Tamir,et al.  High-performance multi-queue buffers for VLSI communications switches , 1988, ISCA '88.

[64]  Qingsheng Hu,et al.  Scalable scheduling architectures for high-performance crossbar-based switches , 2004, 2004 Workshop on High Performance Switching and Routing, 2004. HPSR..

[65]  Luca P. Carloni,et al.  Networks-on-chip in emerging interconnect paradigms: Advantages and challenges , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[66]  David Blaauw,et al.  A 4.5Tb/s 3.4Tb/s/W 64×64 switch fabric with self-updating least-recently-granted priority and quality-of-service arbitration in 45nm CMOS , 2012, 2012 IEEE International Solid-State Circuits Conference.

[67]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[68]  Nick McKeown,et al.  A 50 Gb / s CMOS Crossbar Chip using Asymmetric Serial Links * , .