Decentralized Learning With Lazy and Approximate Dual Gradients

This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA \cite{scaman2017optimal} have optimal communication complexity when the objective is smooth and strongly convex, and are simple to derive. However, they require solving a subproblem at each step. We propose new algorithms that save computation through using (stochastic) gradients and saves communications when previous information is sufficiently useful. Our methods remain relatively simple -- rather than solving a subproblem, they run Katyusha for a small, fixed number of steps from the latest point. An easy-to-compute, local rule is used to decide if a worker can skip a round of communication. Furthermore, our methods provably reduce communication and computation complexities of SSDA and MSDA. In numerical experiments, our algorithms achieve significant computation and communication reduction compared with the state-of-the-art.

[1]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[2]  Laurent Massoulié,et al.  An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums , 2019, NeurIPS.

[3]  Ali H. Sayed,et al.  Exact Diffusion for Distributed Optimization and Learning—Part I: Algorithm Development , 2017, IEEE Transactions on Signal Processing.

[4]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[5]  Hadrien Hendrikx,et al.  Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives , 2018, AISTATS.

[6]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[7]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[8]  Dmitry Kovalev,et al.  Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization , 2020, NeurIPS.

[9]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[10]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[11]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[12]  Haoran Sun,et al.  Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach , 2019, ArXiv.

[13]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[14]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[15]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[16]  Haoran Sun,et al.  Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking , 2020, ICML.

[17]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[18]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[19]  Angelia Nedic,et al.  A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks , 2018, 2020 Information Theory and Applications Workshop (ITA).

[20]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[21]  Wotao Yin,et al.  Decentralized Accelerated Gradient Methods With Increasing Penalty Parameters , 2018, IEEE Transactions on Signal Processing.

[22]  Qing Ling,et al.  On the Linear Convergence of the ADMM in Decentralized Consensus Optimization , 2013, IEEE Transactions on Signal Processing.

[23]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[24]  Louis A. Hageman,et al.  Iterative Solution of Large Linear Systems. , 1971 .

[25]  João M. F. Xavier,et al.  D-ADMM: A Communication-Efficient Distributed Algorithm for Separable Optimization , 2012, IEEE Transactions on Signal Processing.

[26]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[27]  Martin Jaggi,et al.  COLA: Decentralized Linear Learning , 2018, NeurIPS.

[28]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[29]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[30]  Mingyi Hong,et al.  Distributed Non-Convex First-Order optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[31]  Shenghuo Zhu,et al.  Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication , 2018, ArXiv.

[32]  Aryan Mokhtari,et al.  Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.

[33]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[34]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[35]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[36]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[37]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.