Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtarik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods. For the variance reduced methods, we recover the best known convergence rates as special cases. For proximal SGD, the quantization and coordinate type methods, we uncover new state-of-the-art convergence rates. Our analysis also includes any form of sampling and minibatching. As such, we are able to determine the minibatch size that optimizes the total complexity of variance reduced methods. We showcase this by obtaining a simple formula for the optimal minibatch size of two variance reduced methods (\textit{L-SVRG} and \textit{SAGA}). This optimal minibatch size not only improves the theoretical total complexity of the methods but also improves their convergence in practice, as we show in several experiments.

[1]  Robert M. Gower,et al.  Optimal mini-batch and step sizes for SAGA , 2019, ICML.

[2]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[3]  Peter Richtárik,et al.  99% of Parallel Optimization is Inevitably a Waste of Time , 2019, ArXiv.

[4]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[5]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[6]  Peter Richtárik,et al.  Randomized Distributed Mean Estimation: Accuracy vs. Communication , 2016, Front. Appl. Math. Stat..

[7]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[8]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[9]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[10]  Peter Richt'arik,et al.  Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[13]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[14]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[15]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[16]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[17]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[18]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[19]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[20]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[21]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[22]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[23]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[24]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[25]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[26]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[27]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[28]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[29]  Francis Bach,et al.  Towards closing the gap between the theory and practice of SVRG , 2019, NeurIPS.

[30]  Sharan Vaswani,et al.  Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , 2020, ArXiv.

[31]  Benjamin Grimmer,et al.  Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity , 2017, SIAM J. Optim..

[32]  Peter Richtárik,et al.  SEGA: Variance Reduction via Gradient Sketching , 2018, NeurIPS.

[33]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[34]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[35]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[36]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[37]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[38]  Ke Tang,et al.  Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[40]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[41]  Gersende Fort,et al.  On Perturbed Proximal Gradient Algorithms , 2014, J. Mach. Learn. Res..

[42]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[43]  H. Robbins A Stochastic Approximation Method , 1951 .

[44]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[45]  F. Bach,et al.  Stochastic quasi-gradient methods: variance reduction via Jacobian sketching , 2018, Mathematical Programming.

[46]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[47]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[48]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.