Improving the Performance of Distributed MXNet with RDMA

As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.

[1]  Sayantan Sur,et al.  Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.

[2]  Sayantan Sur,et al.  A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[3]  Amith R. Mamidala,et al.  MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning , 2018, ArXiv.

[4]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[5]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[7]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[8]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[9]  Qian Liu,et al.  An Integrated Tutorial on InfiniBand, Verbs, and MPI , 2017, IEEE Communications Surveys & Tutorials.

[10]  Hai Jin,et al.  An Introduction to the InfiniBand Architecture , 2002 .

[11]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[12]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[13]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[14]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[15]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[16]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[17]  Gustavo Pérez,et al.  Automated detection of lung nodules with three-dimensional convolutional neural networks , 2017, Symposium on Medical Information Processing and Analysis.

[18]  Wenting Han,et al.  Improving the Performance of Distributed TensorFlow with RDMA , 2017, International Journal of Parallel Programming.

[19]  Marleen de Bruijne,et al.  Machine learning approaches in medical image analysis: From detection to diagnosis , 2016, Medical Image Anal..