Multi-Path Transport for RDMA in Datacenters

RDMA is becoming prevalent because of its low latency, high throughput and low CPU overhead. However, current RDMA remains a single path transport which is prone to failures and falls short to utilize the rich parallel paths in datacenters. Unlike previous multipath approaches, which mainly focus on TCP, this paper presents a multi-path transport for RDMA, i.e. MPRDMA, which efficiently utilizes the rich network paths in datacenters. MP-RDMA employs three novel techniques to address the challenge of limited RDMA NICs on-chip memory size: 1) a multi-path ACK-clocking mechanism to distribute traffic in a congestion-aware manner without incurring per-path states; 2) an out-of-order aware path selection mechanism to control the level of out-of-order delivered packets, thus minimizes the meta data required to them; 3) a synchronise mechanism to ensure in-order memory update whenever needed. With all these techniques, MP-RDMA only adds 66B to each connection state compared to single-path RDMA. Our evaluation with an FPGA-based prototype demonstrates that compared with single-path RDMA, MPRDMA can significantly improve the robustness under failures (2x∼4x higher throughput under 0.5%∼10% link loss ratio) and improve the overall network utilization by up to 47%.

[1]  Jennifer Rexford,et al.  CLOVE: How I learned to stop worrying about the core and love the edge , 2016, HotNets.

[2]  Ming Zhang,et al.  MicroTE: fine grained traffic engineering for data centers , 2011, CoNEXT '11.

[3]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[4]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[5]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[6]  Yongqiang Xiong,et al.  ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware , 2016, SIGCOMM.

[7]  Jonatha Anselmi,et al.  Decentralized Proportional Load Balancing , 2016, SIAM J. Appl. Math..

[8]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[9]  QUTdN QeO,et al.  Random early detection gateways for congestion avoidance , 1993, TNET.

[10]  Keqiang He,et al.  Presto: Edge-based Load Balancing for Fast Datacenter Networks , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[11]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[12]  Yu Cao,et al.  Explicit multipath congestion control for data center networks , 2013, CoNEXT.

[13]  Dan Pei,et al.  Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers , 2016, USENIX Annual Technical Conference.

[14]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[15]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[16]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[17]  Yu Cao,et al.  Delay-based congestion control for multipath TCP , 2012, 2012 20th IEEE International Conference on Network Protocols (ICNP).

[18]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS.

[19]  Devavrat Shah,et al.  Fastpass: a centralized "zero-queue" datacenter network , 2015, SIGCOMM 2015.

[20]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[21]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[22]  Haitao Wu,et al.  Per-packet load-balanced, low-latency routing for clos-based data center networks , 2013, CoNEXT.

[23]  Enhong Chen,et al.  Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter , 2017, APNet.

[24]  Joseph D. Touch,et al.  Issues in TCP Slow-Start Restart After Idle , 1998 .

[25]  Rong Pan,et al.  Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching , 2017, NSDI.

[26]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[27]  Sally Floyd,et al.  The NewReno Modification to TCP's Fast Recovery Algorithm , 2004, RFC.

[28]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[29]  Amin Vahdat,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[30]  Ramana Rao Kompella,et al.  On the impact of packet spraying in data center networks , 2013, 2013 Proceedings IEEE INFOCOM.

[31]  R. Jain Throughput fairness index : An explanation , 1999 .

[32]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM 2011.

[33]  David A. Maltz,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM 2010.

[34]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[35]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..