论文信息 - Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

Kenneth Heafield | Alham Fikri Aji | Nikolay Bogoychev

[1] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] André F. T. Martins,et al. Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[4] Parijat Dube,et al. Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[5] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[6] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[7] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[8] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[9] Marcin Junczys-Dowmunt,et al. Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation , 2018, EMNLP.

[10] Dan Alistarh,et al. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[11] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.