论文信息 - An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

The recent many-fold increase in the size of deep neural networks makes efficient distributed training challenging. Many proposals exploit the compressibility of the gradients and propose lossy compression techniques to speed up the communication stage of distributed training. Nevertheless, compression comes at the cost of reduced model quality and extra computation overhead. In this work, we design an efficient compressor with minimal overhead. Noting the sparsity of the gradients, we propose to model the gradients as random variables distributed according to some sparsity-inducing distributions (SIDs). We empirically validate our assumption by studying the statistical characteristics of the evolution of gradient vectors over the training process. We then propose Sparsity-Inducing Distribution-based Compression (SIDCo) , a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) while being faster by imposing lower compression overhead. Our extensive evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that SIDCo speeds up training by up to ≈41.7×, 7.6×, and 1.9× compared to the no-compression baseline, Topk, and DGC compressors, respectively.

[1] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[2] Brahim Bensaou,et al. SICC: SDN-based incast congestion control for data centers , 2017, 2017 IEEE International Conference on Communications (ICC).

[3] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[4] R. DeVore,et al. Nonlinear approximation , 1998, Acta Numerica.

[5] Marco Righero,et al. An introduction to compressive sensing , 2009 .

[6] Zhi Ding,et al. Federated Learning Based on Over-the-Air Computation , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[7] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[8] B. Bensaou,et al. Enforcing Transport-Agnostic Congestion Control via SDN in Data Centers , 2017 .

[9] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[10] Wei Zhang,et al. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[11] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[12] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[13] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[14] Rishikesh R. Gajjala,et al. Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning , 2020, DistributedML@CoNEXT.

[15] Marco Canini,et al. DC2: Delay-aware Compression Control for Distributed Machine Learning , 2021, IEEE INFOCOM 2021 - IEEE Conference on Computer Communications.

[16] Athanasios Papoulis,et al. Probability, Random Variables and Stochastic Processes , 1965 .

[17] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[18] Shaohuai Shi,et al. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs , 2017, 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[19] Thomas Hofmann,et al. Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[20] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[21] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[22] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[23] Shaun A. Bond. A review of asymmetric conditional density functions in autoregressive conditional heteroscedasticity models , 2001 .

[24] Prabhat,et al. Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25] Chen Ningjiang,et al. An Implementation of GPU-Based Parallel Optimization for an Extended Uncertain Data Query Algorithm , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[26] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[27] Thorsten Gerber,et al. Handbook Of Mathematical Functions , 2016 .

[28] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29] R. E. Wheeler. Statistical distributions , 1983, APLQ.

[30] BottaAlessio,et al. Measuring network throughput in the cloud , 2015 .

[31] Ahmed M. Abdelmoniem,et al. On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.

[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[33] Aggelos K. Katsaggelos,et al. Bayesian Compressive Sensing Using Laplace Priors , 2010, IEEE Transactions on Image Processing.

[34] Malcolm R Leadbetter,et al. On a basis for 'Peaks over Threshold' modeling , 1991 .

[35] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[36] M. Abramowitz,et al. Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables , 1966 .

[37] Junzhou Huang,et al. Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[38] J. Hosking,et al. Parameter and quantile estimation for the generalized pareto distribution , 1987 .

[39] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[40] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[41] Jersey Chen,et al. Statistical Distributions, Second Edition , 1998, The Yale Journal of Biology and Medicine.

[42] John G. Proakis,et al. Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[43] Marco Chiani,et al. Lossy Compression of Noisy Sparse Sources Based on Syndrome Encoding , 2019, IEEE Transactions on Communications.

[44] Jaeyong Lee,et al. GENERALIZED DOUBLE PARETO SHRINKAGE. , 2011, Statistica Sinica.

[45] Ahmed M. Abdelmoniem,et al. Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation , 2020 .

[46] Andrea Giorgetti,et al. Syndrome-Based Encoding of Compressible Sources for M2M Communication , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[47] S. Nadarajah,et al. Extreme Value Distributions: Theory and Applications , 2000 .

[48] Brahim Bensaou,et al. Taming Latency in Data Centers Via Active Congestion-Probing , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[49] Tong Yang,et al. SketchML: Accelerating Distributed Machine Learning with Data Sketches , 2018, SIGMOD Conference.

[50] Pradeep Dubey,et al. Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[51] Yongjian Wu,et al. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters , 2020, MLSys.

[52] Mingyi Hong,et al. Distributed Learning in the Nonconvex World: From batch data to streaming and beyond , 2020, IEEE Signal Processing Magazine.

[53] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[54] Anshumali Shrivastava,et al. SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems , 2019, MLSys.

[55] Richard L. Smith. Threshold Methods for Sample Extremes , 1984 .

[56] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[57] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[58] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[60] F. Olver. Asymptotics and Special Functions , 1974 .

[61] Andrea Giorgetti,et al. Weak RIC Analysis of Finite Gaussian Matrices for Joint Sparse Recovery , 2017, IEEE Signal Processing Letters.

[62] Brahim Bensaou,et al. Curbing Timeouts for TCP-Incast in Data Centers via A Cross-Layer Faster Recovery Mechanism , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[63] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[64] Samuel Madden,et al. Efficient Top-K Query Processing on Massively Parallel Hardware , 2018, SIGMOD Conference.

[65] Jiawei Jiang,et al. Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript , 2020, ICML.

[66] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[67] Brahim Bensaou,et al. HyGenICC: Hypervisor-based generic IP congestion control for virtualized data centers , 2016, 2016 IEEE International Conference on Communications (ICC).

[68] Tim Verbelen,et al. A Survey on Distributed Machine Learning , 2019, ACM Comput. Surv..

[69] Eric P. Smith,et al. An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[70] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[71] Peng Jiang,et al. A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[72] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[73] Stéphane Mallat,et al. A Wavelet Tour of Signal Processing - The Sparse Way, 3rd Edition , 2008 .

[74] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[75] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[76] Andrea Giorgetti,et al. Limits on Sparse Data Acquisition: RIC Analysis of Finite Gaussian Matrices , 2018, IEEE Transactions on Information Theory.

[77] Aymeric Dieuleveut,et al. Communication trade-offs for Local-SGD with large step size , 2019, NeurIPS.

[78] Vishal Monga,et al. Sparsity Constrained Estimation in Image Processing and Computer Vision , 2017, Handbook of Convex Optimization Methods in Imaging Science.

[79] Brahim Bensaou,et al. Hysteresis-based Active Queue Management for TCP Traffic in Data Centers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[80] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[81] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[82] Shaohuai Shi,et al. Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.