Accelerating Distributed Deep Learning using Multi-Path RDMA in Data Center Networks

Data center networks (DCNs) have widely deployed RDMA to support data-intensive applications such as machine learning. While DCNs are designed with rich multi-path topology, current RDMA (hardware) technology does not support multi-path transport. In this paper we advance Maestro- a purely software-basedmulti-path RDMA solution - to effectively utilize the rich multi-path topology for load balancing and reliability. As a "middleware" operating at the user-space, Maestro is modulaR@and software-defined:Maestro decouples path selection and load balancing mechanisms from hardware features, and allows DCN operators and applications to make flexible decisions by employing the best mechanisms as needed. As such, Maestro can be readily deployed using existing RDMA hardware (NICs) to support distributed deep learning (DDL) applications. Our experiments show that Maestro is capable of fully utilizing multiple paths with negligible CPU overheads, thereby enhancing the performance of DDL applications.

[1]  Yifei Lu,et al.  SDN-based TCP congestion control in data center networks , 2015, 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC).

[2]  Vishal Misra,et al.  ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[3]  Rong Pan,et al.  Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching , 2017, NSDI.

[4]  Ming Zhang,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2022 .

[5]  Wenzhong Li,et al.  Toward Effective and Fair RDMA Resource Sharing , 2018, APNet '18.

[6]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[7]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Feilong Liu,et al.  Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems , 2017, EuroSys.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Peter Phaal,et al.  InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks , 2001, RFC.

[12]  Ruben Mayer,et al.  Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .

[13]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[14]  Mark Handley,et al.  TCP Extensions for Multipath Operation with Multiple Addresses , 2020, RFC.

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Enhong Chen,et al.  Multi-Path Transport for RDMA in Datacenters , 2018, NSDI.

[17]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[18]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[19]  Monia Ghobadi,et al.  Rethinking end-to-end congestion control in software-defined networks , 2012, HotNets-XI.

[20]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Devavrat Shah,et al.  Fastpass , 2014, SIGCOMM.

[23]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM.

[24]  Shudong Jin,et al.  Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Yi Wang,et al.  RDMA Load Balancing via Data Partition , 2019, 2019 28th International Conference on Computer Communication and Networks (ICCCN).

[26]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[27]  Gustavo Alonso,et al.  Minimizing the Hidden Cost of RDMA , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[28]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[29]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[30]  Haitao Wu,et al.  Per-packet load-balanced, low-latency routing for clos-based data center networks , 2013, CoNEXT.

[31]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[32]  Richard Wang,et al.  OpenFlow-Based Server Load Balancing Gone Wild , 2011, Hot-ICE.