An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization

In this paper, we study decentralized online stochastic non-convex optimization over a network of nodes. Integrating a technique called gradient tracking in decentralized stochastic gradient descent, we show that the resulting algorithm, <monospace><bold>GT-DSGD</bold></monospace>, enjoys certain desirable characteristics towards minimizing a sum of smooth non-convex functions. In particular, for general smooth non-convex functions, we establish non-asymptotic characterizations of <monospace><bold>GT-DSGD</bold></monospace> and derive the conditions under which it achieves network-independent performances that match the centralized minibatch <monospace><bold>SGD</bold></monospace>. In contrast, the existing results suggest that <monospace><bold>GT-DSGD</bold></monospace> is always network-dependent and is therefore strictly worse than the centralized minibatch <monospace><bold>SGD</bold></monospace>. When the global non-convex function additionally satisfies the Polyak-Łojasiewics (PL) condition, we establish the linear convergence of <monospace><bold>GT-DSGD</bold></monospace> up to a steady-state error with appropriate constant step-sizes. Moreover, under stochastic approximation step-sizes, we establish, for the first time, the optimal global sublinear convergence rate on almost every sample path, in addition to the asymptotically optimal sublinear rate in expectation. Since strongly convex functions are a special case of the functions satisfying the PL condition, our results are not only immediately applicable but also improve the currently known best convergence rates and their dependence on problem parameters.

[1]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[2]  Jiaqi Zhang,et al.  Decentralized Stochastic Gradient Tracking for Empirical Risk Minimization , 2019, ArXiv.

[3]  Na Li,et al.  Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[4]  Soummya Kar,et al.  Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence , 2020, IEEE Signal Processing Magazine.

[5]  Lihua Xie,et al.  Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[6]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[7]  Sonia Martínez,et al.  Discrete-time dynamic average consensus , 2010, Autom..

[8]  Wei Shi,et al.  A Decentralized Proximal-Gradient Method With Network Independent Step-Sizes and Separated Convergence Rates , 2017, IEEE Transactions on Signal Processing.

[9]  Usman A. Khan,et al.  A Linear Algorithm for Optimization Over Directed Graphs With Geometric Convergence , 2018, IEEE Control Systems Letters.

[10]  I. Gijbels,et al.  Penalized likelihood regression for generalized linear models with non-quadratic penalties , 2011 .

[11]  Ali H. Sayed,et al.  Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points , 2019, IEEE Transactions on Signal Processing.

[12]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[13]  Soummya Kar,et al.  Consensus + innovations distributed inference over networks: cooperation and sensing in networked systems , 2013, IEEE Signal Processing Magazine.

[14]  U. Khan,et al.  Variance-Reduced Decentralized Stochastic Optimization With Accelerated Convergence , 2019, IEEE Transactions on Signal Processing.

[15]  Gesualdo Scutari,et al.  Distributed nonconvex constrained optimization over time-varying digraphs , 2018, Mathematical Programming.

[16]  Ali H. Sayed,et al.  On the Influence of Bias-Correction on Distributed Stochastic Optimization , 2019, IEEE Transactions on Signal Processing.

[17]  Ali H. Sayed,et al.  Exact Diffusion for Distributed Optimization and Learning—Part I: Algorithm Development , 2017, IEEE Transactions on Signal Processing.

[18]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent , 2020, IEEE Signal Processing Magazine.

[19]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[20]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[21]  Usman A. Khan,et al.  A General Framework for Decentralized Optimization With First-Order Methods , 2020, Proceedings of the IEEE.

[22]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[23]  Na Li,et al.  Distributed Zero-Order Algorithms for Nonconvex Multi-Agent optimization , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[24]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[25]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[26]  Ioannis Ch. Paschalidis,et al.  Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions , 2018, J. Mach. Learn. Res..

[27]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[28]  Soummya Kar,et al.  Convergence Rate Analysis of Distributed Gossip (Linear Parameter) Estimation: Fundamental Limits and Tradeoffs , 2010, IEEE Journal of Selected Topics in Signal Processing.

[29]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[30]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[31]  Ioannis Ch. Paschalidis,et al.  A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent , 2019, IEEE Transactions on Automatic Control.

[32]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[33]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[34]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[35]  Jiaqi Zhang,et al.  Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization , 2019 .

[36]  H. Robbins A Stochastic Approximation Method , 1951 .

[37]  Mikhail Borisovich Nevelʹson,et al.  Stochastic Approximation and Recursive Estimation , 1976 .

[38]  Wei Shi,et al.  A Push-Pull Gradient Method for Distributed Optimization in Networks , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[39]  Jianyu Wang,et al.  SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2020, ICLR.

[40]  H. Poor,et al.  Distributed Stochastic Gradient Descent and Convergence to Local Minima , 2020 .

[41]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[42]  Songtao Lu,et al.  GNSD: a Gradient-Tracking Based Nonconvex Stochastic Algorithm for Decentralized Optimization , 2019, 2019 IEEE Data Science Workshop (DSW).

[43]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..