论文信息 - Packet chaining: Efficient single-cycle allocation for on-chip networks

Packet chaining: Efficient single-cycle allocation for on-chip networks

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

[1] Simon W. Moore,et al. Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[2] Marc Snir,et al. The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[3] Nick McKeown,et al. Designing and implementing a fast crossbar scheduler , 1999, IEEE Micro.

[4] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .

[5] Federico Silla,et al. A comparative study of arbitration algorithms for the Alpha 21364 pipelined router , 2002, ASPLOS X.

[6] Niraj K. Jha,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[7] Nick McKeown,et al. The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[8] Eun Jung Kim,et al. Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[10] George Michelogiannakis,et al. An analysis of on-chip interconnection networks for large-scale chip multiprocessors , 2010, TACO.

[11] Nan Jiang,et al. Packet Chaining: Efficient Single-Cycle Allocation for On-Chip Networks , 2011, IEEE Computer Architecture Letters.

[12] D. R. Fulkerson,et al. Maximal Flow Through a Network , 1956 .

[13] Mike Galles. Spider: a high-speed network interconnect , 1997, IEEE Micro.

[14] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[15] Dean M. Tullsen,et al. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[17] A. Kumary,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .

[18] Niraj K. Jha,et al. Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[19] Yuval Tamir,et al. Symmetric Crossbar Arbiters for VLSI Communication Switches , 1993, IEEE Trans. Parallel Distributed Syst..

[20] William J. Dally,et al. Allocator implementations for network-on-chip routers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21] Sriram R. Vangal,et al. A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[22] W. Dally,et al. Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[23] Z. Ding,et al. A Near-optimal Real-time Hardware Scheduler for Large Cardinality Crossbar Switches , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[24] Niraj K. Jha,et al. Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[25] Chita R. Das,et al. Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[26] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.