A case for bufferless routing in on-chip networks

Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection networks that eliminates the need for buffers for routing or flow control. We describe new algorithms for routing without using buffers in router input/output ports. We analyze the advantages and disadvantages of bufferless routing and discuss how router latency can be reduced by taking advantage of the fact that input/output buffers do not exist. Our evaluations show that routing without buffers significantly reduces the energy consumption of the on-chip cache/processor-to-cache network, while providing similar performance to that of existing buffered routing algorithms at low network utilization (i.e., on most real applications). We conclude that bufferless routing can be an attractive and energy-efficient design option for on-chip cache/processor-to-cache networks where network utilization is low.

[1]  Pedro López,et al.  Reducing Packet Dropping in a Bufferless NoC , 2008, Euro-Par.

[2]  Axel Jantsch,et al.  Evaluation of on-chip networks using deflection routing , 2006, GLSVLSI '06.

[3]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[4]  Axel Jantsch,et al.  Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[5]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[6]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[7]  William J. Dally,et al.  The torus routing chip , 2005, Distributed Computing.

[8]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[9]  Antonios Symvonis,et al.  A General Method for Deflection Worm Routing on Meshes Based on Packet Routing Algorithms , 1997, IEEE Trans. Parallel Distributed Syst..

[10]  Luis Gravano,et al.  Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks , 1994, IEEE Trans. Parallel Distributed Syst..

[11]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[12]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  Mikko H. Lipasti,et al.  Circuit-Switched Coherence , 2007, IEEE Comput. Archit. Lett..

[14]  William Jalby,et al.  XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[15]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[16]  George Michelogiannakis,et al.  Elastic-buffer flow control for on-chip networks , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[17]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[19]  Maurice Herlihy,et al.  Routing without flow control , 2001, SPAA '01.

[20]  Stefano Bregni,et al.  Performance Evaluation of Deflection Routing in Optical IP Packet-Switched Networks , 2004, Cluster Computing.

[21]  Xi Wang,et al.  Burst optical deflection routing protocol for wavelength routing WDM networks , 2000, Other Conferences.

[22]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[23]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[24]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[25]  J. Duato,et al.  BPS : A Bufferless Switching Technique for NoCs ∗ , 2008 .

[26]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[27]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[28]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[29]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[30]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[32]  William J. Dally,et al.  GOAL: a load-balanced adaptive routing algorithm for torus networks , 2003, ISCA '03.

[33]  S. Konstantinidou,et al.  Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[34]  Coniferous softwood GENERAL TERMS , 2003 .

[35]  P. Baran,et al.  On Distributed Communications Networks , 1964 .

[36]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[37]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[38]  Baruch Schieber,et al.  Fast deflection routing for packets and worms , 1993, PODC '93.

[39]  Uriel Feige,et al.  Exact analysis of hot-potato routing , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[40]  Maurice Herlihy,et al.  Hard-Potato routing , 2000, STOC '00.

[41]  Philippe Roussel,et al.  The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[42]  Niraj K. Jha,et al.  Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[43]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[44]  George Michelogiannakis,et al.  Approaching Ideal NoC Latency with Pre-Configured Routes , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[45]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[46]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[47]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[48]  Avi Mendelson,et al.  Fairness and Throughput in Switch on Event Multithreading , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[49]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[50]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[51]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[52]  S. Lennart Johnsson,et al.  ROMM routing on mesh and torus networks , 1995, SPAA '95.

[53]  Sanjay Bhansali,et al.  Framework for instruction-level tracing and analysis of program executions , 2006, VEE '06.

[54]  Doug Burger,et al.  Implementation and Evaluation of On-Chip Network Architectures , 2006, 2006 International Conference on Computer Design.