Congestion Control for Large-Scale RDMA Deployments

Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 μs per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness. To alleviates these problems, we introduce DCQCN, an end-to-end congestion control scheme for RoCEv2. To optimize DCQCN performance, we build a fluid model, and provide guidelines for tuning switch buffer thresholds, and other protocol parameters. Using a 3-tier Clos network testbed, we show that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. DCQCN is implemented in Mellanox NICs, and is being deployed in Microsoft's datacenters.

[1]  QUTdN QeO,et al.  Random early detection gateways for congestion avoidance , 1993, TNET.

[2]  A. K. Choudhury,et al.  Dynamic queue length thresholds for shared-memory packet switches , 1998, TNET.

[3]  David L. Black,et al.  The Addition of Explicit Congestion Notification (ECN) to IP , 2001, RFC.

[4]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[5]  Nick McKeown,et al.  RCP-AC: Congestion Control to Make Flows Complete Quickly in Any Environment , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[6]  Ram Huggahalli,et al.  Architectural Breakdown of End-to-End Latency in a TCP/IP Network , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[7]  Renato Recio,et al.  A Remote Direct Memory Access Protocol Specification , 2007, RFC.

[8]  José Duato,et al.  RECN-IQ: A Cost-Effective Input-Queued Switch Architecture with Congestion Management , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[9]  Nick McKeown,et al.  Rate control protocol (rcp): congestion control to make flows complete quickly , 2008 .

[10]  Rong Pan,et al.  Data center transport mechanisms: Congestion control theory and IEEE standardization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[11]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[12]  Amar Phanishayee,et al.  Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[13]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[14]  A. L. Narasimha Reddy,et al.  Performance of Quantized Congestion Notification in TCP Incast Scenarios of Data Centers , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[15]  Olav Lysne,et al.  First experiences with congestion control in InfiniBand hardware , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Leonid Oliker,et al.  Communication Requirements and Interconnect Optimization for High-End Scientific Applications , 2007, IEEE Transactions on Parallel and Distributed Systems.

[17]  Ramana Rao Kompella,et al.  vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[19]  B. Atikoglu,et al.  Stability analysis of QCN: the averaging principle , 2011, SIGMETRICS '11.

[20]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS '11.

[21]  D. Zats,et al.  DeTail: reducing the flow completion time tail in datacenter networks , 2012, CCRV.

[22]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[23]  Xin Wu,et al.  NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.

[24]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[25]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[26]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Mark Handley,et al.  Network stack specialization for performance , 2013, HotNets.

[28]  Hari Balakrishnan,et al.  TCP ex machina: computer-generated congestion control , 2013, SIGCOMM.

[29]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[30]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[31]  Ankit Singla,et al.  Practical DCB for improved data center networks , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[32]  Devavrat Shah,et al.  Fastpass , 2014, SIGCOMM.

[33]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[34]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[35]  Hari Balakrishnan,et al.  An experimental study of the learnability of congestion control , 2014, SIGCOMM.

[36]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[37]  Priority Flow Control : Build Reliable Layer 2 Infrastructure , 2015 .

[38]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..