A fast randomized incremental gradient method for decentralized non-convex optimization

We study decentralized non-convex finite-sum minimization problems described over a network of nodes, where each node possesses a local batch of data samples. We propose a single-timescale first-order randomized incremental gradient method, termed as GT-SAGA. GT-SAGA is computationally efficient since it evaluates only one component gradient per node per iteration and achieves provably fast and robust performance by leveraging node-level variance reduction and network-level gradient tracking. For general smooth non-convex problems, we show almost sure and mean-squared convergence to a first-order stationary point and describe regimes of practical significance where GT-SAGA achieves a network-independent convergence rate and outperforms the existing approaches respectively. When the global cost function further satisfies the Polyak-Lojaciewisz condition, we show that GT-SAGA exhibits global linear convergence to an optimal solution in expectation and describe regimes of practical interest where the performance is network-independent and improves upon the existing work. Numerical experiments based on real-world datasets are included to highlight the behavior and convergence aspects of the proposed method.

[1]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[2]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[3]  Alexander J. Smola,et al.  Fast incremental method for smooth nonconvex optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[4]  Ali H. Sayed,et al.  Distributed Coupled Multiagent Stochastic Optimization , 2020, IEEE Transactions on Automatic Control.

[5]  Qing Ling,et al.  DLM: Decentralized Linearized Alternating Direction Method of Multipliers , 2015, IEEE Transactions on Signal Processing.

[6]  U. Khan,et al.  Variance-Reduced Decentralized Stochastic Optimization With Accelerated Convergence , 2019, IEEE Transactions on Signal Processing.

[7]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[8]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[9]  Shahin Shahrampour,et al.  Distributed Projected Subgradient Method for Weakly Convex Optimization , 2020, ArXiv.

[10]  Musa A. Mammadov,et al.  From Convex to Nonconvex: A Loss Function Analysis for Binary Classification , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[11]  Haoran Sun,et al.  Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach , 2019, ArXiv.

[12]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[13]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[14]  Optimal Accelerated Variance Reduced EXTRA and DIGing for Strongly Convex and Smooth Decentralized Optimization , 2020, ArXiv.

[15]  Na Li,et al.  Distributed Zero-Order Algorithms for Nonconvex Multi-Agent optimization , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Lihua Xie,et al.  Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[17]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[18]  Wei Shi,et al.  A Decentralized Proximal-Gradient Method With Network Independent Step-Sizes and Separated Convergence Rates , 2017, IEEE Transactions on Signal Processing.

[19]  U. Khan,et al.  A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization , 2020, ArXiv.

[20]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[21]  Ying Sun,et al.  Distributed Algorithms for Composite Optimization: Unified Framework and Convergence Analysis , 2020, IEEE Transactions on Signal Processing.

[22]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[23]  K. Johansson,et al.  A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization , 2020, IEEE/CAA Journal of Automatica Sinica.

[24]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[25]  Ying Sun,et al.  Distributed Algorithms for Composite Optimization: Unified and Tight Convergence Analysis , 2020, ArXiv.

[26]  Jie Lu,et al.  A Unifying Approximate Method of Multipliers for Distributed Composite Optimization , 2020, IEEE Transactions on Automatic Control.

[27]  Dusan Jakovetic,et al.  A Unification and Generalization of Exact Distributed First-Order Methods , 2017, IEEE Transactions on Signal and Information Processing over Networks.

[28]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[29]  Soummya Kar,et al.  Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence , 2020, IEEE Signal Processing Magazine.

[30]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[31]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[32]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[33]  Soummya Kar,et al.  An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization , 2020, IEEE Transactions on Signal Processing.

[34]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[35]  Usman A. Khan,et al.  A General Framework for Decentralized Optimization With First-Order Methods , 2020, Proceedings of the IEEE.

[36]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[37]  Ali H. Sayed,et al.  On the Influence of Bias-Correction on Distributed Stochastic Optimization , 2019, IEEE Transactions on Signal Processing.

[38]  Usman A. Khan,et al.  Optimization over time-varying directed graphs with row and column-stochastic matrices , 2018, 1810.07393.

[39]  Yongchun Fang,et al.  Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization , 2020, 2009.04373.

[40]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[41]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[42]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning by Networked Agents Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[43]  Guoqiang Hu,et al.  Randomized Gradient-Free Distributed Optimization Methods for a Multiagent System With Unknown Cost Function , 2020, IEEE Transactions on Automatic Control.

[44]  Soummya Kar,et al.  Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication , 2008, IEEE Transactions on Information Theory.

[45]  Ali H. Sayed,et al.  Decentralized Proximal Gradient Algorithms With Linear Convergence Rates , 2019, IEEE Transactions on Automatic Control.

[46]  Ioannis Ch. Paschalidis,et al.  A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent , 2019, IEEE Transactions on Automatic Control.

[47]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[48]  H. Poor,et al.  Distributed Stochastic Gradient Descent and Convergence to Local Minima , 2020 .

[49]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[50]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[51]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[52]  Anna Scaglione,et al.  Decentralized Frank–Wolfe Algorithm for Convex and Nonconvex Problems , 2016, IEEE Transactions on Automatic Control.

[53]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[54]  Soummya Kar,et al.  Fast Decentralized Nonconvex Finite-Sum Optimization with Recursive Variance Reduction , 2020, SIAM J. Optim..

[55]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[56]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[57]  Ali H. Sayed,et al.  Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate , 2019, IEEE Transactions on Signal Processing.

[58]  Angelia Nedic,et al.  Accelerating incremental gradient optimization with curvature information , 2018, Comput. Optim. Appl..

[59]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[60]  Na Li,et al.  Harnessing smoothness to accelerate distributed optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[61]  Xinlei Yi,et al.  Linear Convergence of First- and Zeroth-Order Primal–Dual Algorithms for Distributed Nonconvex Optimization , 2019, IEEE Transactions on Automatic Control.

[62]  Haoran Sun,et al.  Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking , 2020, ICML.

[63]  Anit Kumar Sahu,et al.  Distributed Zeroth Order Optimization Over Random Networks: A Kiefer-Wolfowitz Stochastic Approximation Approach , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[64]  Aryan Mokhtari,et al.  Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate , 2016, SIAM J. Optim..

[65]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[66]  Qing Ling,et al.  On the Convergence of Decentralized Gradient Descent , 2013, SIAM J. Optim..

[67]  Shahin Shahrampour,et al.  On Distributed Nonconvex Optimization: Projected Subgradient Method for Weakly Convex Problems in Networks , 2020, IEEE Transactions on Automatic Control.

[68]  Qing Ling,et al.  Decentralized linearized alternating direction method of multipliers , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Giuseppe Notarstefano,et al.  Distributed Big-Data Optimization via Blockwise Gradient Tracking , 2018, IEEE Transactions on Automatic Control.

[70]  Joakim Jald'en,et al.  A Geometrically Converging Dual Method for Distributed Optimization Over Time-Varying Graphs , 2018, IEEE Transactions on Automatic Control.