Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup. In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental. An adaptive compression strategy can reduce communication while maintaining final test accuracy. Inspired by recent findings on critical learning regimes, in which small gradient errors can have irrecoverable impact on model performance, we propose Accordion a simple yet effective adaptive compression algorithm. While Accordion maintains a high enough compression rate on average, it avoids over-compressing gradients whenever in critical learning regimes, detected by a simple gradient-norm based criterion. Our extensive experimental study over a number of machine learning tasks in distributed environments indicates that Accordion, maintains similar model accuracy to uncompressed training, yet achieves up to 5.5x better compression and up to 4.1x end-to-end speedup over static approaches. We show that Accordion also works for adjusting the batch size, another popular strategy for alleviating communication bottlenecks.

[1]  Ruixuan Li,et al.  Accelerating Distributed Deep Learning By Adaptive Gradient Quantization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[3]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[4]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[5]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[6]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[7]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael Garland,et al.  AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , 2017, ArXiv.

[11]  Luis Ceze,et al.  PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud , 2020, MLSys.

[12]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[15]  Shaohuai Shi,et al.  Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[18]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[19]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[21]  Wei Zhang,et al.  AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[22]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[23]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[24]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Joseph Gonzalez,et al.  On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.

[27]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[28]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, ArXiv.

[29]  Hao Liang,et al.  Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Shaohuai Shi,et al.  A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[31]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[32]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nam Sung Kim,et al.  GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training , 2018, NeurIPS.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[38]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[40]  Kurt Keutzer,et al.  Large batch size training of neural networks with adversarial training and second-order information , 2018, ArXiv.

[41]  Dimitris S. Papailiopoulos,et al.  Gradient Diversity Empowers Distributed Learning , 2017, ArXiv.

[42]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[43]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[44]  David J. Schwab,et al.  The Early Phase of Neural Network Training , 2020, ICLR.

[45]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[46]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[47]  Christopher De Sa,et al.  Distributed Learning with Sublinear Communication , 2019, ICML.

[48]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[49]  Stefano Soatto,et al.  Critical Learning Periods in Deep Networks , 2018, ICLR.

[50]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.