Modern routers and switch fabrics can have hundreds of input and output ports running at up to 10 Gb/s; 40 Gb/s systems are starting to appear. At these rates, the performance of the buffering and queuing subsystem becomes a significant bottleneck. In high performance routers with more than a few queues, packet buffering is typically implemented using DRAM for data storage and a combination of off-chip and on-chip SRAM for storing the linked-list nodes and packet length, and the queue headers, respectively. This paper focuses on the performance bottlenecks associated with the use of offchip SRAM. We show how the combination of implicit buffer pointers and multi-buffer list nodes can dramatically reduce the impact of buffering and queuing subsystem on queuing performance. We also show how combining it with a coarsegrained scheduling can improve the performance of fair queuing algorithms, while also reducing the amount of off-chip memory and bandwidth needed. These techniques can reduce the amount of SRAM needed to hold the list nodes by a factor of 10 at the cost of about 10% wastage of the DRAM space, assuming an aggregation degree of 16.
[1]
Manolis Katevenis,et al.
Efficient per-flow queueing in DRAM at OC-192 line rate using out-of-order execution techniques
,
2001,
ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).
[2]
Abhay Parekh,et al.
A generalized processor sharing approach to flow control in integrated services networks-the multiple node case
,
1993,
IEEE INFOCOM '93 The Conference on Computer Communications, Proceedings.
[3]
Ramana Rao Kompella,et al.
Analysis of a memory architecture for fast packet buffers
,
2001,
2001 IEEE Workshop on High Performance Switching and Routing (IEEE Cat. No.01TH8552).
[4]
Cheng Song,et al.
High performance TCP in ANSNET
,
1994,
CCRV.