Request-grant scheduling for congestion elimination in multistage networks

This thesis considers buffered multistage interconnection networks (fabrics), and investigates methods to reduce their buffer size requirements. Our contribution is a novel flow and congestion control scheme that achieves performance close to that of per-flow queueing while requiring much less buffer space than what per-flow queues would need. The new scheme utilizes a request-grant pre-approval phase, as many contemporary bufferless networks do, but its operation is much simpler and its performance is remarkably better. Traditionally, the role of requests in bufferless networks is to reserve an available time slot on each link along a packet’s route, where these time slots are contiguous in time along the path, so as to guarantee non-conflicting packet transmission. These requirements impose a very heavy toll on the scheduling unit of such bufferless fabrics. By contrast, our requests do not reserve links for a specific time duration, but instead only reserve space in the buffers at their entry points; effectively, the scheduling decisions that concern different links are decoupled among themselves, leading to a much simpler admission process. The proposed scheduling subsystem comprises independent single-resource schedulers, operating in a pipeline; they operate asynchronously to each other. In this thesis we show that the reservation of buffers in front of critical network links –links that are unable to carry the potential aggregate demand– eliminates congestion, in the sense that traffic flows seamlessly through the network: it neither gets dropped, nor is excessively blocked waiting for downstream buffers to become available. First, we apply request-grant scheduling to a single-stage switch, with small, shared output queues, which serves as a model for the more challenging multistage case. We demonstrate that, in principle, a very small number of fabric buffers suffices to reach high performance levels: with 12-cell buffer space per output, performance is better than in buffered crossbars, which consume N cells of buffer space per output, where N is the number of ports. In this single-stage setting, we study the impact of input contention on scheduler performance, and the related synchronization phenomena. During this work, we have introduced a novel scheduling scheme for buffered crossbar switches that makes buffer size independent of the round-trip-time between the linecards and the switch. We then proceed to the multistage case. Our main motivation and our primary benchmark is an example next-generation fabric challenge: a 1024 × 1024, 3-stage, non-blocking Clos/Benes fabric, running with no internal speedup, made of 96 singlechip 32× 32 buffered crosssbar switching elements (3 stages of 32 switch chips each). To eliminate congestion in the fabric, we carefully apply our request-grant scheduling protocol. We demonstrate that it is feasible to implement all schedulers centrally, in a single chip. Besides congestion elimination, our scheduler can guarantee 100 percent in-order delivery, using very small reorder buffers, which can easily fit in on-chip memory. Simulation results indicate very good delay performance, and throughput that exceeds 95% under unbalanced traffic. Most prominent is the result that, under hot-spot traffic, with almost all output ports being congested, the non-congested outputs experience negligible delay degradation. The proposed system can directly operate on variable-size packets, eliminating the padding overhead and the associated internal speed-up. We also discuss a possible distributed version of the scheduling subsystem. Our scheme is appropriate to deal with heavy congestion; in systems that need to provide very low latency under (uncongested) light traffic, one would apply this scheme when the load exceeds a given threshold. Lastly, we consider some blocking network topologies, like the banyan. In a banyan network, besides output ports, internal links can cause congestion as well. We show a fully distributed scheduler for this network, that eliminates congestion from both internal and output-port links.

[1]  H. Jonathan Chao,et al.  On the performance of a dual round-robin switch , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[2]  Ken Christensen,et al.  A parallel-polled virtual output queued switch with a buffered crossbar , 2001, 2001 IEEE Workshop on High Performance Switching and Routing (IEEE Cat. No.01TH8552).

[3]  A. Bianco,et al.  Performance analysis of storage area network switches , 2005, HPSR. 2005 Workshop on High Performance Switching and Routing, 2005..

[4]  Nick McKeown,et al.  Analysis of the parallel packet switch architecture , 2003, TNET.

[5]  M. Katevenis,et al.  Preventing buffer-credit accumulations in switches with small, shared output queues , 2006, 2006 Workshop on High Performance Switching and Routing.

[6]  Ken Christensen,et al.  An evolution to crossbar switches with virtual output queuing and buffered cross points , 2003 .

[7]  Manolis Katevenis,et al.  Transient Behavior of a Buffered Crossbar Converging to Weighted Max-Min Fairness , 2022 .

[8]  F. M. Chiussi,et al.  Generalized inverse multiplexing of switched ATM connections , 1998, IEEE GLOBECOM 1998 (Cat. NO. 98CH36250).

[9]  Prashanth Pappu,et al.  Work-conserving distributed schedulers for Terabit routers , 2004, SIGCOMM 2004.

[10]  Jonathan Turner,et al.  Strong Performance Guarantees for Asynchronous Crossbar Schedulers , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[11]  A. Charny,et al.  On the speedup required for work-conserving crossbar switches , 1998, 1998 Sixth International Workshop on Quality of Service (IWQoS'98) (Cat. No.98EX136).

[12]  Ilias Iliadis,et al.  A New Feedback Congestion Control Policy for Long Propagation Delays , 1995, IEEE J. Sel. Areas Commun..

[13]  Thomas E. Anderson,et al.  High speed switch scheduling for local area networks , 1992, ASPLOS V.

[14]  Nsf Ncr,et al.  A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single Node Case* , 1991 .

[15]  Mounir Hamdi,et al.  Scheduling multicast traffic in internally buffered crossbar switches , 2004, 2004 IEEE International Conference on Communications (IEEE Cat. No.04CH37577).

[16]  Dimitrios N. Serpanos,et al.  Two-dimensional round-robin schedulers for packet switches with multiple input queues , 1994, TNET.

[17]  Manolis Katevenis,et al.  Fast switching and fair control of congested flow in broadband networks , 1987, IEEE J. Sel. Areas Commun..

[18]  Samuel P. Morgan,et al.  Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..

[19]  Eiji Oki,et al.  A pipeline-based approach for maximal-sized matching scheduling in input-buffered switches , 2001, IEEE Communications Letters.

[20]  Mounir Hamdi,et al.  MCBF: a high-performance scheduling algorithm for buffered crossbar switches , 2003, IEEE Communications Letters.

[21]  Nick McKeown,et al.  Scheduling algorithms for input-queued cell switches , 1996 .

[22]  Manolis Katevenis,et al.  Weighted fairness in buffered crossbar scheduling , 2003, Workshop on High Performance Switching and Routing, 2003, HPSR..

[23]  Nikos I. Chrysos,et al.  Design Issues of Variable-Packet-Size, Multiple-Priority Buffered Crossbars , 2003 .

[24]  Nick McKeown,et al.  Practical algorithms for performance guarantees in buffered crossbars , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[25]  Marco Ajmone Marsan,et al.  Packet scheduling in input-queued cell-based switches , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[26]  Manolis Katevenis,et al.  Multiple priorities in a two-lane buffered crossbar , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[27]  H. Jonathan Chao,et al.  Matching algorithms for three-stage bufferless Clos network switches , 2003, IEEE Commun. Mag..

[28]  Qimin Yang,et al.  Performances of the Data Vortex switch architecture under nonuniform and bursty traffic , 2002 .

[29]  Panayotis Antoniadis,et al.  FIRM: a class of distributed scheduling algorithms for high-speed ATM switches with multiple input queues , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[30]  Mitchell Gusat,et al.  Optimizing flow control for buffered switches , 2002, Proceedings. Eleventh International Conference on Computer Communications and Networks.

[31]  Manolis Katevenis,et al.  Benes switching fabrics with O(N)-complexity internal backpressure , 2005, IEEE Communications Magazine.

[32]  Kenji Yoshigoe Rate-based flow-control for the CICQ switch , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[33]  Cyriel Minkenberg,et al.  Reducing memory size in buffered crossbars with large internal flow control latency , 2003, GLOBECOM '03. IEEE Global Telecommunications Conference (IEEE Cat. No.03CH37489).

[34]  E. L. Hahne,et al.  Round-Robin Scheduling for Max-Min Fairness in Data Networks , 1991, IEEE J. Sel. Areas Commun..

[35]  Nick McKeown,et al.  The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[36]  Robert B. Magill,et al.  Output-queued switch emulation by fabrics with limited memory , 2003, IEEE J. Sel. Areas Commun..

[37]  Hui Zhang,et al.  Implementing distributed packet fair queueing in a scalable switch architecture , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[38]  Manolis Katevenis,et al.  Variable-size multipacket segments in buffered crossbar (CICQ) architectures , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[39]  Nan Ni,et al.  Congestion control in InfiniBand networks , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[40]  D. Y. Kwak,et al.  Desynchronized Input Buffered Switch with Buffered Crossbar , 2003 .

[41]  Paolo Giaccone,et al.  On the maximal throughput of networks with finite buffers and its application to buffered crossbars , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[42]  José Duato,et al.  Dynamic Evolution of Congestion Trees: Analysis and Impact on Switch Architecture , 2005, HiPEAC.

[43]  F. M. Chiussi,et al.  Low-cost scalable switching solutions for broadband networking: the ATLANTA architecture and chipset , 1997 .

[44]  G. F. Georgakopoulos Few buffers suffice: Explaining why and how crossbars with weighted fair queuing converge to weighted max-min fairness , 2003 .

[45]  Dennis G. Shea,et al.  The SP2 High-Performance Switch , 1995, IBM Syst. J..

[46]  Manolis Katevenis,et al.  Scheduling in switches with small internal buffers , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[47]  George F. Georgakopoulos Nash equilibria as a fundamental issue concerning network-switches design , 2004, 2004 IEEE International Conference on Communications (IEEE Cat. No.04CH37577).

[48]  Georgios Passas,et al.  Performance Evaluation of Variable Packet Size Buffered Crossbar Switches , 2003 .

[49]  Mario Gerla,et al.  Flow Control: A Comparative Survey , 1980, IEEE Trans. Commun..

[50]  José Duato,et al.  A cost-effective technique to reduce HOL blocking in single-stage and multistage switch fabrics , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[51]  Cyriel Minkenberg,et al.  10 A Four-Terabit Packet Switch Supporting Long Round-Trip Times , 2003, IEEE Micro.

[52]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[53]  Eiji Oki,et al.  Concurrent round-robin-based dispatching schemes for Clos-network switches , 2002, TNET.

[54]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[55]  Nick McKeown,et al.  The throughput of a buffered crossbar switch , 2005, IEEE Communications Letters.

[56]  William J. Dally,et al.  Flit-reservation flow control , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[57]  Dimitrios N. Serpanos,et al.  Credit-flow-controlled ATM for MP interconnection: The ATLAS I single-chip ATM switch , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[58]  Achille Pattavina,et al.  Performance analysis of ATM Banyan networks with shared queueing—part I: random offered traffic , 1994, TNET.

[59]  Prashanth Pappu,et al.  Distributed queueing in scalable high performance routers , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[60]  C. Minkenberg,et al.  A combined input and output queued packet switched system based on PRIZMA switch on a chip technology , 2000, IEEE Communications Magazine.

[61]  Manolis Katevenis,et al.  Scheduling in Non-Blocking Buffered Three-Stage Switching Fabrics , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[62]  Zhen Zhou,et al.  Space-memory-memory architecture for CLOS-network packet switches , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[63]  V. Benes Optimal rearrangeable multistage connecting networks , 1964 .

[64]  San-Qi Li,et al.  Performance of a nonblocking space-division packet switch with correlated input traffic , 1992, IEEE Trans. Commun..

[65]  G. Jack Lipovski,et al.  Banyan networks for partitioning multiprocessor systems , 1973, ISCA '73.

[66]  George Kornaros,et al.  ATLAS I: implementing a single-chip ATM switch with backpressure , 1999, IEEE Micro.

[67]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[68]  Aristides Efthymiou,et al.  Pipelined memory shared buffer for VLSI switches , 1995, SIGCOMM '95.

[69]  Tara Javidi,et al.  A high-throughput scheduling algorithm for a buffered crossbar switch fabric , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[70]  A. K. Choudhury,et al.  Dynamic queue length thresholds for shared-memory packet switches , 1998, TNET.

[71]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[72]  Nick McKeown,et al.  A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch , 1999 .