Breaking the on-chip latency barrier using SMART

As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.

[1]  Sanjay J. Patel,et al.  Rigel: A 1,024-Core Single-Chip Accelerator Architecture , 2011, IEEE Micro.

[2]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[3]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[4]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[5]  Anantha Chandrakasan,et al.  Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI , 2012, DAC Design Automation Conference 2012.

[6]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[7]  Stephen B. Furber,et al.  Chain: A Delay-Insensitive Chip Area Interconnect , 2002, IEEE Micro.

[8]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[9]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Li-Shiuan Peh,et al.  SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS , 2010, 2010 IEEE International Conference on Computer Design.

[11]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[12]  Jan M. Rabaey,et al.  Digital Integrated Circuits: A Design Perspective , 1995 .

[13]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  William J. Dally,et al.  The BlackWidow High-Radix Clos Network , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[15]  Ming Yang,et al.  CNoC: High-Radix Clos Network-on-Chip , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[16]  Li-Shiuan Peh,et al.  Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Niraj K. Jha,et al.  Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  Hideharu Amano,et al.  Prediction router: Yet another low latency on-chip router architecture , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[21]  Vladimir Stojanovic,et al.  Equalized interconnects for on-chip networks: modeling and optimization framework , 2007, ICCAD 2007.

[22]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[23]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[24]  Alexander Sprintson,et al.  Asynchronous Bypass Channels: Improving Performance for Multi-synchronous NoCs , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[25]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[26]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[27]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[28]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[29]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  M. Erez,et al.  Express Virtual Channels with Capacitively Driven Global Links , 2009, IEEE Micro.

[31]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[33]  Jens Sparsø,et al.  A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip , 2005, Design, Automation and Test in Europe.

[34]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.