Swift: Delay is Simple and Effective for Congestion Control in the Datacenter

We report on experiences with Swift congestion control in Google datacenters. Swift targets an end-to-end delay by using AIMD control, with pacing under extreme congestion. With accurate RTT measurement and care in reasoning about delay targets, we find this design is a foundation for excellent performance when network distances are well-known. Importantly, its simplicity helps us to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a congestion signal while the datacenter evolves. In large-scale testbed experiments, Swift delivers a tail latency of <50μs for short RPCs, with near-zero packet drops, while sustaining ~100Gbps throughput per server. This is a tail of <3x the minimal latency at a load close to 100%. In production use in many different clusters, Swift achieves consistently low tail completion times for short RPCs, while providing high throughput for long RPCs. It has loss rates that are at least 10x lower than a DCTCP protocol, and handles O(10k) incasts that sharply degrade with DCTCP.

[1]  Changhyun Lee,et al.  DX: Latency-Based Congestion Control for Datacenters , 2017, IEEE/ACM Transactions on Networking.

[2]  Dongsu Han,et al.  Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.

[3]  Jordan Tigani,et al.  Google BigQuery Analytics , 2014 .

[4]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[5]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[6]  Raj Jain,et al.  A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems , 1998, ArXiv.

[7]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[8]  Konstantin Avrachenkov,et al.  Early Retransmit for TCP and Stream Control Transmission Protocol (SCTP) , 2010, RFC.

[9]  Mark Allman,et al.  Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions , 2004, RFC.

[10]  Kai Chen,et al.  Scheduling Mix-flows in Commodity Datacenters with Karuna , 2016, SIGCOMM.

[11]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[12]  Vishal Misra,et al.  ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[13]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[14]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[15]  Sally Floyd,et al.  An Extension to the Selective Acknowledgement (SACK) Option for TCP , 2000, RFC.

[16]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[17]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[18]  Amin Vahdat,et al.  Carousel: Scalable Traffic Shaping at End Hosts , 2017, SIGCOMM.

[19]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[20]  Larry Peterson,et al.  TCP Vegas: new techniques for congestion detection and avoidance , 1994, SIGCOMM 1994.

[21]  Haitao Wu,et al.  Enabling ECN over Generic Packet Scheduling , 2016, CoNEXT.

[22]  Thomas F. Wenisch,et al.  1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters , 2020, SIGCOMM.

[23]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[24]  Nick McKeown,et al.  Why flow-completion time is the right metric for congestion control , 2006, CCRV.

[25]  Christoforos E. Kozyrakis,et al.  Flash storage disaggregation , 2016, EuroSys.

[26]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.