论文信息 - Iteration number-based hierarchical gradient aggregation for distributed deep learning

Iteration number-based hierarchical gradient aggregation for distributed deep learning

Distributed deep learning can effectively accelerate neural model training, which employs multiple workers at a cluster of nodes to train a neural network in a parallel way. In this paper, we propose InHAD, an asynchronous distributed deep learning protocol, whose key novelty lies in the design of hierarchical gradient communication and aggregation. The local aggregation is conducted inside a computing node to aggregate the gradients produced by local workers, while global aggregation is conducted at the parameter server to calculate new model parameters based on results of local aggregations. An iteration number-based mechanism is designed to guarantee the convergence of the training. Worker nodes keep sending IN to parameter server and the latter decides when to pull gradients from workers by counting IN and updates global model weights/parameters. With the IN-based hierarchical aggregation technique, InHAD can save communication cost by reducing the number of gradients transferred and speed up the convergence by limiting the staleness of gradients. We conduct extensive experiments at the Tianhe-2 supercomputer system to evaluate the performance of InHAD. Two neural networks are trained on two classical datasets, and similar protocols like Horovod and ASP are tested for comparisons. The results show that InHAD can achieve much higher acceleration than ASP and nearly the same accuracy as Horovod .

Jieying Zhou | Yunfei Du | Weigang Wu | Danyang Xiao | Xinxin Li

[1] Chan-Hyun Youn,et al. BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster , 2019, The Journal of Supercomputing.

[2] Jaesik Choi,et al. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism , 2020, USENIX ATC.

[3] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[4] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[5] Andreas Stolcke,et al. The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Eric P. Xing,et al. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[7] James Demmel,et al. Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs , 2019, IEEE Transactions on Parallel and Distributed Systems.

[8] Weigang Wu,et al. Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning , 2020, ICPP.

[9] Gangzhao Lu,et al. Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading , 2020 .

[10] Jiangtao Ren,et al. Standard Deviation Based Adaptive Gradient Compression For Distributed Deep Learning , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[11] Joan Bruna,et al. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[12] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[13] James T. Kwok,et al. Asynchronous Distributed ADMM for Consensus Optimization , 2014, ICML.

[14] Ji Liu,et al. Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[15] Dynamic parameter allocation in parameter servers , 2020, Proc. VLDB Endow..

[16] Martin Elsman,et al. Software for Incremental Flattening for Nested Data Parallelism , 2019 .

[17] Nong Xiao,et al. Model Parallelism Optimization for Distributed Inference Via Decoupled CNN Structure , 2021, IEEE Transactions on Parallel and Distributed Systems.

[18] Wei Zhang,et al. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[19] Yanbo Xue,et al. Distributed Training of Deep Learning Models: A Taxonomic Perspective , 2020, IEEE Transactions on Parallel and Distributed Systems.

[20] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[21] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[22] Alexander Tiskin,et al. BSP (Bulk Synchronous Parallelism) , 2011, Encyclopedia of Parallel Computing.

[23] Dan Alistarh,et al. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[24] Alexander J. Smola,et al. An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[25] Kannan Ramchandran,et al. Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism , 2020, KDD.

[26] Adrián Castelló,et al. PyDTNN: A user-friendly and extensible framework for distributed deep learning , 2021, The Journal of Supercomputing.

[27] Alexander J. Smola,et al. Scalable inference in latent variable models , 2012, WSDM '12.

[28] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[29] Elad Hoffer,et al. Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[30] Ji Liu,et al. DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[31] Martin Elsman,et al. Incremental flattening for nested data parallelism , 2019, PPoPP.

[32] Hamid Reza Feyzmahdavian,et al. An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[33] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[34] Weigang Wu,et al. EGC: Entropy-based gradient compression for distributed deep learning , 2021, Inf. Sci..

[35] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[36] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.