Dart: Divide and Specialize for Fast Response to Congestion in RDMA-Based Datacenter Networks

Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10<inline-formula> <tex-math notation="LaTeX">$\times$ </tex-math></inline-formula>), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called <italic>Dart</italic>, which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose <italic>direct apportioning of sending rates (DASR)</italic> in which a receiver for <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> senders directs each sender to cut its rate by a factor of <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>, converging in only one RTT. For the spatially-localized case, Dart provides fast (under one RTT) response by adding novel switch hardware for <italic>in-order flow deflection (IOFD)</italic> because RDMA disallows packet reordering on which previous load balancing schemes rely. For the uncommon spatially-dispersed case, Dart falls back to DCQCN. Small-scale testbed measurements and at-scale simulations, respectively, show that Dart achieves 60% (2.5<inline-formula> <tex-math notation="LaTeX">$\times$ </tex-math></inline-formula>) and 79% (4.8<inline-formula> <tex-math notation="LaTeX">$\times$ </tex-math></inline-formula>) lower <inline-formula> <tex-math notation="LaTeX">$99^{th}$ </tex-math></inline-formula>-percentile latency, and similar and 58% higher throughput than InfiniBand, and TIMELY and DCQCN.

[1]  David A. Maltz,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM 2010.

[2]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[3]  Minlan Yu,et al.  DIBS: just-in-time congestion mitigation for data centers , 2014, EuroSys '14.

[4]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[5]  Federico Silla,et al.  Improving the efficiency of adaptive routing in networks with irregular topology , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[6]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[7]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[8]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[9]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[10]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[11]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM 2011.

[12]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[14]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[15]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[16]  Jeffrey C. Mogul,et al.  SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies , 2010, NSDI.

[17]  Keqiang He,et al.  Presto: Edge-based Load Balancing for Fast Datacenter Networks , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[18]  Lixin Gao,et al.  On inferring autonomous system relationships in the Internet , 2000, Globecom '00 - IEEE. Global Telecommunications Conference. Conference Record (Cat. No.00CH37137).

[19]  Brighten Godfrey,et al.  DRILL: Micro Load Balancing for Low-latency Data Center Networks , 2017, SIGCOMM.

[20]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[21]  Amin Vahdat,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[22]  Mark Handley,et al.  How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP , 2012, NSDI.

[23]  Jae-Hyun Hwang,et al.  Deadline and Incast Aware TCP for cloud data center networks , 2014, Comput. Networks.

[24]  Randy H. Katz,et al.  DeTail: reducing the flow completion time tail in datacenter networks , 2012, SIGCOMM '12.

[25]  Kai Chen,et al.  Scheduling Mix-flows in Commodity Datacenters with Karuna , 2016, SIGCOMM.

[26]  Scott Shenker,et al.  Universal Packet Scheduling , 2015, NSDI.

[27]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[28]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[29]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[30]  Aleksandar Kuzmanovic,et al.  Enabling router-assisted congestion control on the Internet , 2016, 2016 IEEE 24th International Conference on Network Protocols (ICNP).

[31]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, CCRV.

[32]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[33]  Ankit Singla,et al.  Practical DCB for improved data center networks , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[34]  Jitendra Padhye,et al.  Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them , 2016, HotNets.

[35]  Christo Wilson,et al.  Better never than late , 2011, SIGCOMM 2011.

[36]  Sandeep Chinchali,et al.  NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters , 2016, SIGCOMM.

[37]  Arvind Krishnamurthy,et al.  High-resolution measurement of data center microbursts , 2017, Internet Measurement Conference.

[38]  Ramana Rao Kompella,et al.  On the impact of packet spraying in data center networks , 2013, 2013 Proceedings IEEE INFOCOM.

[39]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[40]  Abdul Kabbani,et al.  FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks , 2014, CoNEXT.

[41]  Albert G. Greenberg,et al.  EyeQ: Practical Network Performance Isolation at the Edge , 2013, NSDI.

[42]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[43]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[44]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[45]  Olav Lysne,et al.  First experiences with congestion control in InfiniBand hardware , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[46]  Nick McKeown,et al.  Rate control protocol (rcp): congestion control to make flows complete quickly , 2008 .

[47]  P. Baran,et al.  On Distributed Communications Networks , 1964 .

[48]  Hong Zhang,et al.  Resilient Datacenter Load Balancing in the Wild , 2017, SIGCOMM.

[49]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[50]  Dongsu Han,et al.  Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.

[51]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[52]  Scott Shenker,et al.  Revisiting network support for RDMA , 2018, SIGCOMM.

[53]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[54]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2013, IEEE/ACM Transactions on Networking.

[55]  Nick McKeown,et al.  Processor Sharing Flows in the Internet , 2005, IWQoS.