Architectural integration of rf-interconnect to enhance on-chip communication for many-core chip multiprocessors

As the number of cores integrated onto a die increases, the potential for increased throughput exists, as work can be decomposed into parallel tasks and divided among the computing elements. However, the communication architecture between cores must handle an increase in bandwidth demand, as the total traffic on the interconnection network scales with the number of active processor cores and cache banks. Poor wire scaling has complicated things further, leading to increased latency cost of communication between distant points on the die. This motivates our investigation of advanced interconnect architectures aimed at reducing on-chip memory-access latency as well as network power consumption. RF-Interconnect (RF-I) is a low-latency, low-power, high-bandwidth signaling technology projected to scale better than both conventional wires or competing transmission line technologies. In this dissertation, we present the first on-chip interconnection network which utilizes RF-I technology as an overlay upon a conventional wire mesh topology, logically behaving as a set of express-channels (or shortcuts) between distant end-points on the chip. Assuming a 400mm2 (lie, we have demonstrated that in exchange for 0.18% of area overhead on the active layer, RF-I can provide an average 13% (max 18%) boost in application performance, corresponding to an average 22% (max 24%) reduction in packet latency. We have also observed that the patterns of core-to-core and core-to-cache communication may vary over the course of an application's execution as well as within an application. Informed by these findings, we have employed dynamically adaptive RF-I shortcuts in our topology, which provide communication bandwidth only where and when required. Furthermore, we reduce overall network power-consumption by reducing bandwidth of the baseline mesh links, substituting RF-I for conventional-wire to realize an overall power and performance gain. We demonstrate that this overall strategy can enable a 65% NoC power savings as well as an 82.3% NoC area savings, with a simultaneous performance gain of 1%. Finally, we demonstrate that even under ideal conditions, there is no significant performance nor power advantage to adapting RF-I shortcuts more frequently than once per application.

[1]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[2]  Rajeev Balasubramonian,et al.  Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[3]  Chita R. Das,et al.  ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4]  Vivek De,et al.  Sub-90nm technologies: challenges and opportunities for CAD , 2002, ICCAD 2002.

[5]  Jason Cong,et al.  RF interconnects for communications on-chip , 2008, ISPD '08.

[6]  Chita R. Das,et al.  A novel dimensionally-decomposed router for on-chip communication in 3D architectures , 2007, ISCA '07.

[7]  Alberto L. Sangiovanni-Vincentelli,et al.  Constraint-driven communication synthesis , 2002, DAC '02.

[8]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[9]  Alyssa B. Apsel,et al.  Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Simcha Gochman,et al.  Introduction to Intel Core Duo Processor Architecture , 2006 .

[11]  Bradford M. Beckmann,et al.  TLC: transmission line caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[12]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[13]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[15]  Jason Cong,et al.  Power reduction of CMP communication networks via RF-interconnects , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[17]  M.-C.F. Chang,et al.  Advanced RF/baseband interconnect schemes for inter- and intra-ULSI communications , 2005, IEEE Transactions on Electron Devices.

[18]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[19]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[20]  Chita R. Das,et al.  MIRA: A Multi-layered On-Chip Interconnect Router Architecture , 2008, 2008 International Symposium on Computer Architecture.

[21]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[22]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[24]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[25]  T.M. Pinkston,et al.  On Deadlocks In Interconnection Networks , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[26]  Vwani P. Roychowdhury,et al.  RF/wireless interconnect for inter- and intra-chip communications , 2001, Proc. IEEE.

[27]  Mahmut T. Kandemir,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Brad Calder,et al.  Detecting phases in parallel applications on shared memory architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[29]  William J. Dally,et al.  Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks , 1989, IEEE Trans. Computers.

[30]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[31]  Radu Marculescu,et al.  "It's a small world after all": NoC performance optimization via long-range link insertion , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[32]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[33]  Michael Zhang,et al.  Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors , 2005, ISCA 2005.

[34]  Radu Marculescu,et al.  Towards Open Network-on-Chip Benchmarks , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[35]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[37]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[38]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[39]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[40]  Todd M. Austin,et al.  Polymorphic On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[41]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[42]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[43]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[44]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[45]  T. N. Vijaykumar,et al.  Optimizing Replication, Communication, and Capacity Allocation in CMPs , 2005, ISCA 2005.

[46]  Jason Cong,et al.  CMP network-on-chip overlaid with multi-band RF-interconnect , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[47]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[48]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[49]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[50]  Yi Zhu,et al.  Communication latency aware low power NoC synthesis , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[51]  Lionel M. Ni,et al.  The turn model for adaptive routing , 1992, ISCA '92.

[52]  Axel Jantsch,et al.  Network on Chip : An architecture for billion transistor era , 2000 .

[53]  Chita R. Das,et al.  Performance and power optimization through data compression in Network-on-Chip architectures , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[54]  Jason Cong,et al.  Fine grain 3D integration for microarchitecture design through cube packing exploration , 2007, 2007 25th International Conference on Computer Design.

[55]  T. N. Vijaykumar,et al.  Distance associativity for high-performance energy-efficient non-uniform cache architectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[56]  Zhiwei Xu,et al.  An RF/baseband FDMA-interconnect transceiver for reconfigurable multiple access chip-to-chip communication , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[57]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[58]  José Duato,et al.  A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources , 2001, IEEE Trans. Parallel Distributed Syst..