CADA: Communication-Adaptive Distributed Adam

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[3]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[4]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[5]  Francis Bach,et al.  On the Convergence of Adam and Adagrad , 2020, ArXiv.

[6]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[7]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[8]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[9]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[10]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[11]  Qing Ling,et al.  Asynchronous periodic event-triggered coordination of multi-agent systems , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[12]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[13]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[14]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[15]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[16]  Cho-Jui Hsieh,et al.  A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order , 2016, NIPS.

[17]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[18]  Jorge Cortés,et al.  Event-triggered communication and control of networked systems for multi-agent consensus , 2017, Autom..

[19]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[20]  Jianyu Wang,et al.  SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2020, ICLR.

[21]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[22]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[23]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[24]  Karl Henrik Johansson,et al.  Distributed Event-Triggered Control for Multi-Agent Systems , 2012, IEEE Transactions on Automatic Control.

[25]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[26]  Stefan Wrobel,et al.  Efficient Decentralized Deep Learning by Dynamic Model Averaging , 2018, ECML/PKDD.

[27]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[28]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[29]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[30]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[31]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[32]  Lijun Zhang,et al.  SAdam: A Variant of Adam for Strongly Convex Functions , 2019, ICLR.

[33]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[34]  Jie Chen,et al.  Asynchronous parallel adaptive stochastic gradient methods , 2020, ArXiv.

[35]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[36]  Ji Liu,et al.  APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm , 2020, ArXiv.

[37]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Peng Jiang,et al.  A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[40]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[41]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[42]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[43]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[44]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[45]  Farzin Haddadpour,et al.  Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization , 2019, NeurIPS.

[46]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[47]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[48]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[49]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[50]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[51]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[52]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[53]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[54]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[55]  Wotao Yin,et al.  LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning , 2020, ArXiv.

[56]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[57]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[58]  Georgios B. Giannakis,et al.  Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients , 2019, NeurIPS.

[59]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[60]  Mehdi Bennis,et al.  Wireless Network Intelligence at the Edge , 2018, Proceedings of the IEEE.

[61]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[62]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[63]  Aryan Mokhtari,et al.  Robust and Communication-Efficient Collaborative Learning , 2019, NeurIPS.