Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can \emph{achieve acceleration}, requiring $\mathcal{O}(\kappa \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\kappa \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of \emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $\mathcal{O}(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\mathcal{O}(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods.

[1]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[2]  Assaf Schuster,et al.  DANA: Scalable Out-of-the-box Distributed ASGD Without Retuning , 2018 .

[3]  Usman A. Khan,et al.  Distributed Heavy-Ball: A Generalization and Acceleration of First-Order Methods With Gradient Tracking , 2018, IEEE Transactions on Automatic Control.

[4]  Zhouchen Lin,et al.  A Sharp Convergence Rate Analysis for Distributed Accelerated Gradient Methods , 2018, 1810.01053.

[5]  Ali Taylan Cemgil,et al.  A framework for parallel second order incremental optimization algorithms for solving partially separable problems , 2019, Comput. Optim. Appl..

[6]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[7]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[8]  Usman A. Khan,et al.  ADD-OPT: Accelerated Distributed Directed Optimization , 2016, IEEE Transactions on Automatic Control.

[9]  Stephen P. Boyd,et al.  Disciplined Convex Programming , 2006 .

[10]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[11]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[12]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[13]  Alejandro Ribeiro,et al.  Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems , 2017, SIAM J. Optim..

[14]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[15]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[16]  A. Gasnikov,et al.  Decentralized and Parallelized Primal and Dual Accelerated Methods for Stochastic Convex Programming Problems , 2019, 1904.09015.

[17]  Kilian Q. Weinberger,et al.  Optimal Convergence Rates for Convex Distributed Optimization in Networks , 2019, J. Mach. Learn. Res..

[18]  Ioannis Ch. Paschalidis,et al.  A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent , 2019, ArXiv.

[19]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[20]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21]  Ermin Wei,et al.  Superlinearly convergent asynchronous distributed network newton method , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[22]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[23]  Bin Hu,et al.  Dissipativity Theory for Nesterov's Accelerated Method , 2017, ICML.

[24]  Ali H. Sayed,et al.  Distributed Coupled Multiagent Stochastic Optimization , 2017, IEEE Transactions on Automatic Control.

[25]  Na Li,et al.  Accelerated Distributed Nesterov Gradient Descent , 2017, IEEE Transactions on Automatic Control.

[26]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[27]  Asuman E. Ozdaglar,et al.  A Universally Optimal Multistage Accelerated Stochastic Gradient Method , 2019, NeurIPS.

[28]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[29]  Ali H. Sayed,et al.  Distributed Coupled Learning Over Adaptive Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Tie-Yan Liu,et al.  Asynchronous Accelerated Stochastic Gradient Descent , 2016, IJCAI.

[31]  Michael G. Rabbat,et al.  Distributed strongly convex optimization , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[32]  José M. F. Moura,et al.  Fast Distributed Gradient Methods , 2011, IEEE Transactions on Automatic Control.

[33]  Mert Gürbüzbalaban,et al.  Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances , 2019, ICML.

[34]  Stephen J. Wright,et al.  A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization , 2018, KDD.

[35]  A. Kibangou,et al.  Distributed Estimation of Graph Laplacian Eigenvalues by the Alternating Direction of Multipliers Method , 2014 .

[36]  Gesualdo Scutari,et al.  Distributed nonconvex optimization over networks , 2015, 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[37]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[38]  K. Williams The nth Power of a 2 × 2 Matrix , 1992 .

[39]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[40]  Michael G. Rabbat,et al.  Multi-agent mirror descent for decentralized stochastic optimization , 2015, 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[41]  Jérôme Malick,et al.  A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning , 2018, ICML.

[42]  Ali Taylan Cemgil,et al.  Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization , 2018, ICML.

[43]  Na Li,et al.  Accelerated Distributed Nesterov Gradient Descent for smooth and strongly convex functions , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[44]  Laurent Massoulié,et al.  Optimal Algorithms for Non-Smooth Distributed Optimization in Networks , 2018, NeurIPS.

[45]  Angelia Nedic,et al.  A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks , 2018, 2020 Information Theory and Applications Workshop (ITA).

[46]  Qing Ling,et al.  On the Convergence of Decentralized Gradient Descent , 2013, SIAM J. Optim..

[47]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[48]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[49]  Na Li,et al.  Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[50]  Angelia Nedic,et al.  A Distributed Stochastic Gradient Tracking Method , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[51]  Asuman E. Ozdaglar,et al.  Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions , 2018, SIAM J. Optim..

[52]  Dusan Jakovetic,et al.  A Unification and Generalization of Exact Distributed First-Order Methods , 2017, IEEE Transactions on Signal and Information Processing over Networks.

[53]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[54]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.