论文信息 - Swizzle-Switch Networks for Many-Core Systems

Swizzle-Switch Networks for Many-Core Systems

This work revisits the design of crossbar and high-radix interconnects in light of advances in circuit and layout techniques that improve crossbar scalability, obviating the need for deep multi-stage networks. We employ a new building block, the Swizzle-Switch-an energy and area-efficient switching element that can readily scale to radix 64-that has recently been validated via silicon test chips in 45 nm technology. We evaluate the Swizzle-Switch as both the high-radix building block of a Flattened Butterfly and as a single-stage interconnect, the Swizzle-Switch Network. In the process we address the architectural and layout challenges associated with centralized crossbar systems. Compared to a conventional Mesh, the Flattened Butterfly provides a 15% performance improvement with a 2.5× reduction in the standard deviation of on-chip access times. The Swizzle-Switch Network achieves further gains, providing a 21% improvement in performance, a 3× reduction in on-chip access variability, a 33% reduction in interconnect power, and a 25% reduction in total system energy while only increasing chip area by 7%. Finally, this paper details a 3-D integrated version of the Swizzle-Switch Network, showing up to a 30% gain in performance over the 2-D Swizzle-Switch Network for benchmarks sensitive to interconnect latency. One major concern with 3-D designs is thermal dissipation. We show through detailed thermal analysis that with the highly energy-efficient Swizzle-Switch Network design that the thermal budget is well within that of passive cooling solutions.

[1] Anoop Gupta,et al. Parallel computer architecture - a hardware / software approach , 1998 .

[2] Hsien-Hsin S. Lee,et al. 3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[3] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[4] Nicola Concer,et al. Simulation and analysis of network on chip architectures: ring, spidergon and 2D mesh , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[5] David Blaauw,et al. A 1.07 Tbit/s 128×128 swizzle network for SIMD processors , 2010, 2010 Symposium on VLSI Circuits.

[6] William J. Dally,et al. The BlackWidow High-Radix Clos Network , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[7] William J. Dally,et al. A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[8] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[9] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[10] Mike Galles. Spider: a high-speed network interconnect , 1997, IEEE Micro.

[11] Dionisios N. Pnevmatikatos,et al. VLSI micro-architectures for high-radix crossbar schedulers , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[12] Onur Mutlu,et al. Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13] Onur Mutlu,et al. Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[14] Hugh Garraway. Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[15] Nick Baker,et al. Xbox 360 System Architecture , 2006, IEEE Micro.

[16] Dean M. Tullsen,et al. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17] Martin Hopkins,et al. Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[18] Tobias Bjerregaard,et al. A survey of research and practices of Network-on-chip , 2006, CSUR.

[19] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[20] Nick McKeown,et al. The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[21] Yan Zhang,et al. Power and performance comparison of crossbars and buses as on-chip interconnect structures , 1999, Conference Record of the Thirty-Third Asilomar Conference on Signals, Systems, and Computers (Cat. No.CH37020).

[22] Robert Patti,et al. Techniques for Producing 3D ICs with High-Density Interconnect , 2004 .

[23] Radu Marculescu,et al. Energy- and performance-aware mapping for regular NoC architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24] Michael Gschwind,et al. The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[25] Saurabh Dighe,et al. A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[26] Karthik Ramani,et al. Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[27] Timothy Mattson,et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[28] William J. Dally,et al. Flattened Butterfly Topology for On-Chip Networks , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29] José Duato,et al. A performance evaluation of 2D-mesh, ring, and crossbar interconnects for chip multi-processors , 2009, 2009 2nd International Workshop on Network on Chip Architectures.

[30] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[31] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[32] Krste Asanovic,et al. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[33] Natalie D. Enright Jerger,et al. Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[34] David A. Wood,et al. Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[35] George Kornaros. BCB: A Buffered CrossBar Switch Fabric Utilizing Shared Memory , 2006, 9th EUROMICRO Conference on Digital System Design (DSD'06).

[36] Simon W. Moore,et al. A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[37] Trevor Mudge,et al. SWIFT: A 2.1Tb/s 32×32 self-arbitrating manycore interconnect fabric , 2011, 2011 Symposium on VLSI Circuits - Digest of Technical Papers.

[38] Guang R. Gao,et al. A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[39] Sharad Malik,et al. A power model for routers: modeling Alpha 21364 and InfiniBand routers , 2002, Proceedings 10th Symposium on High Performance Interconnects.

[40] William J. Dally,et al. Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[41] David Blaauw,et al. Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores , 2012, 2012 IEEE International Solid-State Circuits Conference.

[42] Dionisios N. Pnevmatikatos,et al. A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles in a Single Hop and Occupying 6% of Their Area , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[43] Kevin Skadron,et al. Temperature-aware microarchitecture , 2003, ISCA '03.

[44] Chita R. Das,et al. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[45] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[46] Mahmut T. Kandemir,et al. CCC: crossbar connected caches for reducing energy consumption of on-chip multiprocessors , 2003, Euromicro Symposium on Digital System Design, 2003. Proceedings..

[47] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[48] George Michelogiannakis,et al. An analysis of on-chip interconnection networks for large-scale chip multiprocessors , 2010, TACO.

[49] Timothy Johnson,et al. An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[50] Miltos D. Grammatikakis,et al. NoC Topologies Exploration based on Mapping and Simulation Models , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).