Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address this problem, we propose a communication backend named GradientFlow for distributed DNN training, and employ a set of network optimization techniques. First, we integrate ring-based allreduce, mixed-precision training, and computation/communication overlap into GradientFlow. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a single one, and design coarse-grained sparse communication to reduce network traffic by only transmitting important gradient chunks. When training ImageNet/AlexNet on 512 GPUs, our approach achieves 410.2 speedup ratio and completes 95-epoch training in 1.5 minutes, which outperforms existing approaches.

[1]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[2]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[3]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Mike Dubman,et al.  Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[6]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[9]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[10]  Shengen Yan,et al.  Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[11]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[12]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[15]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[16]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[19]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[20]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[21]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[22]  Asim Kadav,et al.  MALT: distributed data-parallelism for existing ML applications , 2015, EuroSys.

[23]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[24]  Pongsakorn U.-Chupala,et al.  ImageNet/ResNet-50 Training in 224 Seconds , 2018, ArXiv.

[25]  Hiroaki Mikami,et al.  Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .

[26]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[27]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[28]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[29]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[30]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[31]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[32]  Eric P. Xing,et al.  Managed communication and consistency for fast data-parallel iterative analytics , 2015, SoCC.

[33]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[34]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[35]  Dong Yu,et al.  On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[37]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[38]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[39]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[40]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[41]  Vikram A. Saletore,et al.  Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train , 2017, ArXiv.

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[44]  Amar Phanishayee,et al.  Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.