One More Config is Enough: Saving (DC)TCP for High-Speed Extremely Shallow-Buffered Datacenters

The link speed in production datacenters is growing fast, from 1 Gbps to 40 Gbps or even 100 Gbps. However, the buffer size of commodity switches increases slowly, e.g., from 4 MB at 1 Gbps to 16 MB at 100 Gbps, thus significantly outpaced by the link speed. In such extremely shallow-buffered networks, today’s TCP/ECN solutions, such as DCTCP, suffer from either excessive packet losses or significant throughput degradation. Motivated by this, we introduce BCC,1 a simple yet effective solution that requires only one more ECN configuration (i.e., shared buffer ECN/RED) at commodity switches. BCC operates upon real-time global shared buffer utilization. When available buffer space suffices, BCC delivers both high throughput and low packet loss rate as prior work; When it gets insufficient, BCC automatically triggers the shared buffer ECN to prevent packet loss at the cost of sacrificing a small amount of throughput. BCC is readily deployable with existing commodity switches. We validate BCC’s efficacy in a 100G testbed and evaluate its performance using extensive simulations. Our results show that BCC maintains low packet loss rate persistently while only slightly degrading throughput when the buffer becomes insufficient. For example, compared to current practice, BCC achieves up to 94.4% lower 99th percentile flow completion time (FCT) for small flows while only degrading average FCT for large flows by up to 3%.1BCC: Buffer-aware Active Queue Management (AQM) scheme for Congestion Control in extremely shallow-buffered datacenters.

[1]  Amar Phanishayee,et al.  Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[2]  Mingwei Xu,et al.  LTTP: An LT-Code Based Transport Protocol for Many-to-One Communication in Data Centers , 2014, IEEE Journal on Selected Areas in Communications.

[3]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[4]  Shuai Wang,et al.  Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[5]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[6]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[7]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[8]  Devavrat Shah,et al.  Fastpass , 2014, SIGCOMM.

[9]  David L. Black,et al.  The Addition of Explicit Congestion Notification (ECN) to IP , 2001, RFC.

[10]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[11]  Haitao Wu,et al.  Enabling ECN over Generic Packet Scheduling , 2016, CoNEXT.

[12]  Guido Appenzeller,et al.  Sizing router buffers , 2004, SIGCOMM '04.

[13]  Feng Liu,et al.  AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization , 2018, SIGCOMM.

[14]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[15]  Kai Chen,et al.  One More Config is Enough: Saving (DC)TCP for High-speed Extremely Shallow-buffered Datacenters , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[16]  George Varghese,et al.  High Speed Networks Need Proactive Congestion Control , 2015, HotNets.

[17]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[18]  Devavrat Shah,et al.  Flowtune: Flowlet Control for Datacenter Networks , 2017, NSDI.

[19]  Randy H. Katz,et al.  FastLane: making short flows shorter with agile drop notification , 2015, SoCC.

[20]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS '11.

[21]  Vishal Misra,et al.  ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[22]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, SIGCOMM '12.

[23]  Hong Zhang,et al.  Resilient Datacenter Load Balancing in the Wild , 2017, SIGCOMM.

[24]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[25]  Jitendra Padhye,et al.  Tagger: Practical PFC Deadlock Prevention in Data Center Networks , 2019, TNET.

[26]  Haitao Wu,et al.  Per-packet load-balanced, low-latency routing for clos-based data center networks , 2013, CoNEXT.

[27]  Ming Zhang,et al.  Duet: cloud scale load balancing with hardware and software , 2015, SIGCOMM.

[28]  Dongsu Han,et al.  Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.

[29]  Haitao Wu,et al.  Tuning ECN for data center networks , 2012, CoNEXT '12.

[30]  Yasir Saleem,et al.  Network Simulator NS-2 , 2015 .

[31]  Yibo Zhu,et al.  Congestion Control for Cross-Datacenter Networks , 2019, 2019 IEEE 27th International Conference on Network Protocols (ICNP).

[32]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[33]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[34]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[35]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[36]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[37]  Chuang Lin,et al.  Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Center , 2014, NSDI.

[38]  Yi Wang,et al.  Aeolus: A Building Block for Proactive Transport in Datacenters , 2020, SIGCOMM.

[39]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[40]  Haitao Wu,et al.  PAC: Taming TCP Incast Congestion Using Proactive ACK Control , 2014, 2014 IEEE 22nd International Conference on Network Protocols.

[41]  Haitao Wu,et al.  Enabling ECN in Multi-Service Multi-Queue Data Centers , 2016, NSDI.

[42]  Fengyuan Ren,et al.  TFC: token flow control in data center networks , 2016, EuroSys.

[43]  Albert G. Greenberg,et al.  Ananta: cloud scale load balancing , 2013, SIGCOMM.

[44]  Jitendra Padhye,et al.  Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them , 2016, HotNets.

[45]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[46]  Junxue Zhang,et al.  Enabling ECN for Datacenter Networks With RTT Variations , 2019, IEEE Transactions on Cloud Computing.

[47]  Glenn Judd,et al.  Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter , 2015, NSDI.

[48]  Minlan Yu,et al.  DIBS: just-in-time congestion mitigation for data centers , 2014, EuroSys '14.

[49]  A. K. Choudhury,et al.  Dynamic queue length thresholds for shared-memory packet switches , 1998, TNET.

[50]  Guihai Chen,et al.  P-PFC: Reducing Tail Latency with Predictive PFC in Lossless Data Center Networks , 2020, IEEE Transactions on Parallel and Distributed Systems.

[51]  Yiming Zhang,et al.  Rate-aware flow scheduling for commodity data center networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[52]  QUTdN QeO,et al.  Random early detection gateways for congestion avoidance , 1993, TNET.

[53]  Ali Munir,et al.  Minimizing flow completion times in data centers , 2013, 2013 Proceedings IEEE INFOCOM.