A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-\rho_w)^2}\right)$, where $1-\rho_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $\Omega \left(\frac{n}{(1-\rho_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.

[1]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Optimization for Machine Learning , 2019, ArXiv.

[2]  R. Srikant,et al.  Distributed Learning Algorithms for Spectrum Sharing in Spatial Random Access Wireless Networks , 2015, IEEE Transactions on Automatic Control.

[3]  Pascal Bianchi,et al.  Convergence of a Multi-Agent Projected Stochastic Gradient Algorithm for Non-Convex Optimization , 2011, IEEE Transactions on Automatic Control.

[4]  Anit Kumar Sahu,et al.  Distributed stochastic optimization with gradient tracking over strongly-connected networks , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[5]  Slawomir Stanczak,et al.  A Distributed Subgradient Method for Dynamic Convex Optimization Problems Under Noisy Information Exchange , 2013, IEEE Journal of Selected Topics in Signal Processing.

[6]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[7]  Gonzalo Mateos,et al.  Distributed Recursive Least-Squares: Stability and Performance Analysis , 2011, IEEE Transactions on Signal Processing.

[8]  Ali H. Sayed,et al.  On the limiting behavior of distributed optimization strategies , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Tamer Başar,et al.  Stochastic Subgradient Algorithms for Strongly Convex Optimization Over Distributed Networks , 2014, IEEE Transactions on Network Science and Engineering.

[10]  Soummya Kar,et al.  Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking - Part II: GT-SVRG , 2019, ArXiv.

[11]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[12]  Shi Pu,et al.  Swarming for Faster Convergence in Stochastic Optimization , 2018, SIAM J. Control. Optim..

[13]  Pascal Bianchi,et al.  Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks , 2014, 53rd IEEE Conference on Decision and Control.

[14]  Lihua Xie,et al.  Convergence of Asynchronous Distributed Gradient Methods Over Stochastic Networks , 2018, IEEE Transactions on Automatic Control.

[15]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[16]  Angelia Nedic,et al.  Distributed Asynchronous Constrained Stochastic Optimization , 2011, IEEE Journal of Selected Topics in Signal Processing.

[17]  Xiaojing Ye,et al.  Decentralized Consensus Algorithm with Delayed and Stochastic Gradients , 2016, SIAM J. Optim..

[18]  Wei Shi,et al.  Federated learning of predictive models from federated Electronic Health Records , 2018, Int. J. Medical Informatics.

[19]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[20]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[21]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[22]  Anit Kumar Sahu,et al.  Convergence Rates for Distributed Stochastic Optimization Over Random Networks , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[23]  Ali H. Sayed,et al.  On the Learning Behavior of Adaptive Networks—Part II: Performance Analysis , 2013, IEEE Transactions on Information Theory.

[24]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[25]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[26]  Gonzalo Mateos,et al.  Proximal-Gradient Algorithms for Tracking Cascades Over Social Networks , 2014, IEEE Journal of Selected Topics in Signal Processing.

[27]  Alexander I. J. Forrester,et al.  Multi-fidelity optimization via surrogate modelling , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[28]  Michael M. Zavlanos,et al.  A Distributed Algorithm for Convex Constrained Optimization Under Noise , 2016, IEEE Transactions on Automatic Control.

[29]  Ali H. Sayed,et al.  Excess-Risk of Distributed Stochastic Learners , 2013, IEEE Transactions on Information Theory.

[30]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[31]  Wei Shi,et al.  Push–Pull Gradient Methods for Distributed Optimization in Networks , 2021, IEEE Transactions on Automatic Control.

[32]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[33]  R. Srikant,et al.  On projected stochastic gradient descent algorithm with weighted averaging for least squares regression , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Ali H. Sayed,et al.  Distributed Coupled Multi-Agent Stochastic Optimization , 2017 .

[35]  John N. Tsitsiklis,et al.  Distributed asynchronous deterministic and stochastic gradient optimization algorithms , 1986 .

[36]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[37]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[38]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[39]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent , 2020, IEEE Signal Processing Magazine.

[40]  H. Robbins A Stochastic Approximation Method , 1951 .

[41]  Asuman E. Ozdaglar,et al.  Constrained Consensus and Optimization in Multi-Agent Networks , 2008, IEEE Transactions on Automatic Control.

[42]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[43]  Wei Shi,et al.  A Push-Pull Gradient Method for Distributed Optimization in Networks , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[44]  Na Li,et al.  Harnessing Smoothness to Accelerate Distributed Optimization , 2016, IEEE Transactions on Control of Network Systems.

[45]  Darinka Dentcheva,et al.  An augmented Lagrangian method for distributed optimization , 2014, Mathematical Programming.

[46]  Ali H. Sayed,et al.  Adaptive Penalty-Based Distributed Stochastic Convex Optimization , 2013, IEEE Transactions on Signal Processing.

[47]  Ioannis Ch. Paschalidis,et al.  Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions , 2018, J. Mach. Learn. Res..

[48]  Ali H. Sayed,et al.  Supervised Learning Under Distributed Features , 2018, IEEE Transactions on Signal Processing.

[49]  Zongli Lin,et al.  Noise Reduction by Swarming in Social Foraging , 2016, IEEE Transactions on Automatic Control.

[50]  Angelia Nedić,et al.  Fast Convergence Rates for Distributed Non-Bayesian Learning , 2015, IEEE Transactions on Automatic Control.

[51]  José M. F. Moura,et al.  Fast Distributed Gradient Methods , 2011, IEEE Transactions on Automatic Control.

[52]  Ali H. Sayed,et al.  On the Learning Behavior of Adaptive Networks—Part I: Transient Analysis , 2013, IEEE Transactions on Information Theory.

[53]  Shi Pu,et al.  A Flocking-Based Approach for Distributed Stochastic Optimization , 2017, Oper. Res..

[54]  Pascal Bianchi,et al.  Success and Failure of Adaptation-Diffusion Algorithms With Decaying Step Size in Multiagent Networks , 2017, IEEE Transactions on Signal Processing.

[55]  Alexander Olshevsky,et al.  Linear Time Average Consensus and Distributed Optimization on Fixed Graphs , 2017, SIAM J. Control. Optim..

[56]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.