Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

As the number of cores on a single chip increases with more recent technologies, a packet-switched on-chip interconnection network has become a de facto communication paradigm for chip multiprocessors (CMPs). However, it is inevitable to suffer from high communication latency due to the increasing number of hops. In this paper, we attempt to accelerate network communication by exploiting communication temporal locality with minimal additional hardware cost in the existing state-of-the-art router architecture. We observe that packets frequently traverse through the same path chosen by previous packets due to repeated communication patterns, such as frequent pair-wise communication. Motivated by our observation, we propose a pseudo-circuit scheme. With previous communication patterns, the scheme reserves crossbar connections creating pseudo-circuits, sharable partial circuits within a single router. It reuses the previous arbitration information to bypass switch arbitration if the next flit traverses through the same pseudo-circuit. To accelerate communication performance further, we also propose two aggressive schemes, pseudo-circuit speculation and buffer bypassing. Pseudo-circuit speculation creates more pseudo-circuits using unallocated crossbar connections while buffer bypassing skips buffer writes to eliminate one pipeline stage.

[1]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[2]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[3]  Sharad Malik,et al.  Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[4]  SankaralingamKarthikeyan,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003 .

[5]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[8]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[9]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[10]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[11]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[13]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[14]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[15]  G. Edward Suh,et al.  Static virtual channel allocation in oblivious routing , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[16]  Norman P. Jouppi,et al.  CACTI 5.0 , 2007 .

[17]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[18]  Theodore R. Bashkow,et al.  A large scale, homogeneous, fully distributed parallel machine, I , 1977, ISCA '77.

[19]  Akif Ali,et al.  Near-optimal worst-case throughput routing for two-dimensional mesh networks , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[20]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[21]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[22]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[23]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[24]  Mikko H. Lipasti,et al.  Circuit-Switched Coherence , 2007, IEEE Comput. Archit. Lett..

[25]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[26]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[27]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[28]  Niraj K. Jha,et al.  Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[29]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[30]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).