论文信息 - Breaking the on-chip latency barrier using SMART

Breaking the on-chip latency barrier using SMART

As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.

[1] Sanjay J. Patel,et al. Rigel: A 1,024-Core Single-Chip Accelerator Architecture , 2011, IEEE Micro.

[2] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[3] Simon W. Moore,et al. Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[4] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[5] Anantha Chandrakasan,et al. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI , 2012, DAC Design Automation Conference 2012.

[6] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[7] Stephen B. Furber,et al. Chain: A Delay-Insensitive Chip Area Interconnect , 2002, IEEE Micro.

[8] Niraj K. Jha,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[9] William J. Dally,et al. Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10] Li-Shiuan Peh,et al. SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS , 2010, 2010 IEEE International Conference on Computer Design.

[11] Michael Gschwind,et al. The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[12] Jan M. Rabaey,et al. Digital Integrated Circuits: A Design Perspective , 1995 .

[13] George Kurian,et al. ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14] William J. Dally,et al. The BlackWidow High-Radix Clos Network , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[15] Ming Yang,et al. CNoC: High-Radix Clos Network-on-Chip , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[16] Li-Shiuan Peh,et al. Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] Niraj K. Jha,et al. Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18] Onur Mutlu,et al. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19] Onur Mutlu,et al. Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20] Hideharu Amano,et al. Prediction router: Yet another low latency on-chip router architecture , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[21] Vladimir Stojanovic,et al. Equalized interconnects for on-chip networks: modeling and optimization framework , 2007, ICCAD 2007.

[22] Timothy Mattson,et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[23] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[24] Alexander Sprintson,et al. Asynchronous Bypass Channels: Improving Performance for Multi-synchronous NoCs , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[25] Sriram R. Vangal,et al. A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[26] William J. Dally,et al. Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[27] Saurabh Dighe,et al. A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[28] A. Kumary,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[29] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30] M. Erez,et al. Express Virtual Channels with Capacitively Driven Global Links , 2009, IEEE Micro.

[31] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32] David A. Wood,et al. Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[33] Jens Sparsø,et al. A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip , 2005, Design, Automation and Test in Europe.

[34] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.