论文信息 - Accelerating Distributed Deep Learning using Multi-Path RDMA in Data Center Networks

Accelerating Distributed Deep Learning using Multi-Path RDMA in Data Center Networks

Data center networks (DCNs) have widely deployed RDMA to support data-intensive applications such as machine learning. While DCNs are designed with rich multi-path topology, current RDMA (hardware) technology does not support multi-path transport. In this paper we advance Maestro- a purely software-basedmulti-path RDMA solution - to effectively utilize the rich multi-path topology for load balancing and reliability. As a "middleware" operating at the user-space, Maestro is modulaR@and software-defined:Maestro decouples path selection and load balancing mechanisms from hardware features, and allows DCN operators and applications to make flexible decisions by employing the best mechanisms as needed. As such, Maestro can be readily deployed using existing RDMA hardware (NICs) to support distributed deep learning (DDL) applications. Our experiments show that Maestro is capable of fully utilizing multiple paths with negligible CPU overheads, thereby enhancing the performance of DDL applications.

[1] Yifei Lu,et al. SDN-based TCP congestion control in data center networks , 2015, 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC).

[2] Vishal Misra,et al. ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[3] Rong Pan,et al. Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching , 2017, NSDI.

[4] Ming Zhang,et al. Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2022 .

[5] Wenzhong Li,et al. Toward Effective and Fair RDMA Resource Sharing , 2018, APNet '18.

[6] Yiying Zhang,et al. LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[7] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.

[8] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9] Feilong Liu,et al. Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems , 2017, EuroSys.

[10] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[11] Peter Phaal,et al. InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks , 2001, RFC.

[12] Ruben Mayer,et al. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .

[13] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.