A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradients, and a wide variety of variance-reduced SGD methods such as SVRG and SAGA. We provide a single convergence analysis for all methods that satisfy the proposed unified assumption, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant. Moreover, our unified analysis is accurate enough to recover or improve upon the best-known convergence results of several classical methods, and also gives new convergence results for many new methods which arise as special cases. In the more general distributed/federated nonconvex optimization setup, we propose two new general algorithmic frameworks differing in whether direct gradient compression (DC) or compression of gradient differences (DIANA) is used. We show that all methods captured by these two frameworks also satisfy our unified assumption. Thus, our unified convergence analysis also captures a large variety of distributed methods utilizing compressed communication. Finally, we also provide a unified analysis for obtaining faster linear convergence rates in this nonconvex regime under the PL condition.

[1]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[2]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[3]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[4]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[5]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[6]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[7]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[8]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[9]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[10]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[11]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[12]  Konstantin Mishchenko,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2020, AISTATS.

[13]  Mihai Anitescu,et al.  Degenerate Nonlinear Programming with a Quadratic Growth Condition , 1999, SIAM J. Optim..

[14]  Xiang Li,et al.  Communication Efficient Decentralized Training with Multiple Local Updates , 2019, ArXiv.

[15]  Yi Zhou,et al.  Random gradient extrapolation for distributed and stochastic optimization , 2017, SIAM J. Optim..

[16]  Xiang Li,et al.  Communication Efficient Decentralized Training with Multiple Local Updates , 2019, ArXiv.

[17]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[18]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[19]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[20]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[21]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[22]  Peter Richtárik,et al.  First Analysis of Local GD on Heterogeneous Data , 2019, ArXiv.

[23]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[24]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[25]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[26]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[27]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[28]  Zhize Li,et al.  Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[29]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[30]  Peter Richt'arik,et al.  Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[31]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[32]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[33]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[34]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[35]  Zhize Li,et al.  A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization , 2020, AISTATS.

[36]  Guanghui Lan,et al.  A unified variance-reduced accelerated gradient method for convex optimization , 2019, NeurIPS.

[37]  Lam M. Nguyen,et al.  ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[38]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[39]  Mehryar Mohri,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[40]  Jian Li,et al.  A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[41]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[42]  Peter Richtárik,et al.  SEGA: Variance Reduction via Gradient Sketching , 2018, NeurIPS.

[43]  Zhize Li,et al.  SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points , 2019, NeurIPS.

[44]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[45]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[46]  Zhize Li,et al.  Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization , 2019, COLT.

[47]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[48]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[49]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[50]  Qiang Yang,et al.  Federated Machine Learning , 2019, ACM Trans. Intell. Syst. Technol..

[51]  Dmitry Kovalev,et al.  Distributed Fixed Point Methods with Compressed Iterates , 2019, ArXiv.

[52]  Z.-Q. Luo,et al.  Error bounds and convergence analysis of feasible descent methods: a general approach , 1993, Ann. Oper. Res..

[53]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[54]  Peter Richtárik,et al.  L-SVRG and L-Katyusha with Arbitrary Sampling , 2019, J. Mach. Learn. Res..

[55]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[56]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.

[57]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[58]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.