Towards timeout-less transport in commodity datacenter networks

Despite recent advances in datacenter networks, timeouts caused by congestion packet losses still remain a major cause of high tail latency. Priority-based Flow Control (PFC) was introduced to make the network lossless, but its Head-of-Line blocking nature causes various performance and management problems. In this paper, we ask if it is possible to design a network that achieves (near) zero timeout only using commodity hardware in datacenters. Our answer is TLT, an extension to existing transport designed to eliminate timeouts. We are inspired by the observation that only certain types of packet drops cause timeouts. Therefore, instead of blindly dropping (TCP) or not dropping packets at all (RoCEv2), TLT proactively drops some packets to ensure the delivery of more important ones, whose losses may cause timeouts. It classifies packets at the host and leverages color-aware thresholding, a feature widely supported by commodity switches, to proactively drop some less important packets. We implement TLT prototypes using VMA to test with real applications. Our testbed evaluation on Redis shows that TLT reduces 99%-ile FCT up to 91.7% on handling bursts of SET operations. In large-scale simulations, TLT augments diverse datacenter transports, from widely-used (TCP, DCTCP, DCQCN) to state-of-the-art (IRN and HPCC), by achieving up to 81% lower tail latency.

[1]  Gautam Kumar,et al.  Swift: Delay is Simple and Effective for Congestion Control in the Datacenter , 2020, SIGCOMM.

[2]  Chuang Lin,et al.  Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Center , 2014, NSDI.

[3]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS '11.

[4]  Dongsu Han,et al.  Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.

[5]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, SIGCOMM '12.

[6]  Vishal Misra,et al.  ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[7]  Haitao Wu,et al.  Tuning ECN for data center networks , 2012, CoNEXT '12.

[8]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[9]  Devavrat Shah,et al.  Fastpass , 2014, SIGCOMM.

[10]  Arvind Krishnamurthy,et al.  High-resolution measurement of data center microbursts , 2017, Internet Measurement Conference.

[11]  Ming Zhang,et al.  Duet: cloud scale load balancing with hardware and software , 2015, SIGCOMM.

[12]  Fengyuan Ren,et al.  Gentle flow control: avoiding deadlock in lossless networks , 2019, SIGCOMM.

[13]  Guido Appenzeller,et al.  Sizing router buffers , 2004, SIGCOMM '04.

[14]  Minlan Yu,et al.  HPCC: high precision congestion control , 2019, SIGCOMM.

[15]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[16]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[17]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[18]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[19]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[20]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[21]  Roch Guérin,et al.  A Single Rate Three Color Marker , 1999, RFC.

[22]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[23]  Roch Guérin,et al.  A Two Rate Three Color Marker , 1999, RFC.

[24]  Randy H. Katz,et al.  FastLane: making short flows shorter with agile drop notification , 2015, SoCC.

[25]  Scott Shenker,et al.  Revisiting network support for RDMA , 2018, SIGCOMM.

[26]  Konstantin Avrachenkov,et al.  Early Retransmit for TCP and Stream Control Transmission Protocol (SCTP) , 2010, RFC.

[27]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[28]  Yongqiang Xiong,et al.  Congestion Control for High-speed Extremely Shallow-buffered Datacenter Networks , 2017, APNet.

[29]  Jitendra Padhye,et al.  Tagger: Practical PFC Deadlock Prevention in Data Center Networks , 2019, TNET.

[30]  Sameh Rabie,et al.  A Differentiated Service Two-Rate, Three-Color Marker with Efficient Handling of in-Profile Traffic , 2005, RFC.

[31]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[32]  Matthew Mathis,et al.  Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses , 2013 .

[33]  A. K. Choudhury,et al.  Dynamic queue length thresholds for shared-memory packet switches , 1998, TNET.

[34]  Kai Chen,et al.  Augmenting Proactive Congestion Control with Aeolus , 2018, APNet '18.

[35]  Amar Phanishayee,et al.  Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[36]  Yongqiang Xiong,et al.  Network Stack as a Service in the Cloud , 2017, HotNets.

[37]  Changhyun Lee,et al.  Accurate Latency-based Congestion Feedback for Datacenters , 2015, USENIX Annual Technical Conference.

[38]  Behnaz Arzani,et al.  Taking the Blame Game out of Data Centers Operations with NetPoirot , 2016, SIGCOMM.

[39]  Van Jacobson,et al.  Traffic phase effects in packet-switched gateways , 1991, CCRV.

[40]  Jitendra Padhye,et al.  Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them , 2016, HotNets.

[41]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.