Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{this https URL}.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shaohuai Shi,et al.  A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[3]  James T. Kwok,et al.  Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[4]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  Shengen Yan,et al.  Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes , 2019, ArXiv.

[7]  Samuel Madden,et al.  Efficient Top-K Query Processing on Massively Parallel Hardware , 2018, SIGMOD Conference.

[8]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[9]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[12]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[13]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[14]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[15]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[16]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[17]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Peng Jiang,et al.  A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[20]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[21]  Farzin Haddadpour,et al.  Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization , 2019, ICML.

[22]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[25]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[26]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Shaohuai Shi,et al.  A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification , 2019, IJCAI.

[29]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[30]  Wei Zhang,et al.  AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[31]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[32]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.