论文信息 - Understanding Top-k Sparsification in Distributed Deep Learning - 字舞流文

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{this https URL}.

Shaohuai Shi | Xiaowen Chu | Simon See | Ka Chun Cheung | S. Shi | X. Chu | S. See | K. Cheung

[1] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Shaohuai Shi,et al. A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[3] James T. Kwok,et al. Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[4] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[5] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6] Shengen Yan,et al. Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes , 2019, ArXiv.

[7] Samuel Madden,et al. Efficient Top-K Query Processing on Massively Parallel Hardware , 2018, SIGMOD Conference.

[8] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[9] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[10] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11] Ji Liu,et al. DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[12] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[13] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[14] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[15] Junzhou Huang,et al. Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[16] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[17] Alejandro Acero,et al. Acoustical and environmental robustness in automatic speech recognition , 1991 .

[18] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19] Peng Jiang,et al. A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[20] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[21] Farzin Haddadpour,et al. Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization , 2019, ICML.

[22] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[23] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[25] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[26] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28] Shaohuai Shi,et al. A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification , 2019, IJCAI.

[29] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[30] Wei Zhang,et al. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[31] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[32] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.