SEGA: Variance Reduction via Gradient Sketching

We propose a randomized first order optimization method--SEGA (SkEtched GrAdient method)-- which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random linear measurements (sketches) of the gradient obtained from an oracle. In each iteration, SEGA updates the current estimate of the gradient through a sketch-and-project operation using the information provided by the latest sketch, and this is subsequently used to compute an unbiased estimate of the true gradient through a random relaxation procedure. This unbiased estimate is then used to perform a gradient step. Unlike standard subspace descent methods, such as coordinate descent, SEGA can be used for optimization problems with a non-separable proximal term. We provide a general convergence analysis and prove linear convergence for strongly convex objectives. In the special case of coordinate sketches, SEGA can be enhanced with various techniques such as importance sampling, minibatching and acceleration, and its rate is up to a small constant factor identical to the best-known rate of coordinate descent.

[1]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[2]  Peter Richtárik,et al.  Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods , 2017, Computational Optimization and Applications.

[3]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[4]  Peter Richtárik,et al.  SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization , 2015, ICML.

[5]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[6]  Peter Richtárik,et al.  On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[7]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[8]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[11]  Robert M. Gower,et al.  Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization , 2018, NeurIPS.

[12]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[13]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[16]  Robert Hooke,et al.  `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.

[17]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[18]  Lin Xiao,et al.  An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[19]  Peter Richtárik,et al.  Linearly Convergent Randomized Iterative Methods for Computing the Pseudoinverse , 2016, 1612.06255.

[20]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.

[21]  Tamara G. Kolda,et al.  Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods , 2003, SIAM Rev..

[22]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[23]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[24]  Robert M. Gower,et al.  Randomized Quasi-Newton Updates Are Linearly Convergent Matrix Inversion Algorithms , 2016, SIAM J. Matrix Anal. Appl..

[25]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[26]  F. Bach,et al.  Stochastic quasi-gradient methods: variance reduction via Jacobian sketching , 2018, Mathematical Programming.

[27]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[28]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[29]  Peter Richtárik,et al.  Accelerated Gossip via Stochastic Heavy Ball Method , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[30]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[31]  Peter Richtárik,et al.  Stochastic Dual Ascent for Solving Linear Systems , 2015, ArXiv.

[32]  Michael I. Jordan,et al.  Breaking Locality Accelerates Block Gauss-Seidel , 2017, ICML.

[33]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[34]  Ion Necoara,et al.  Randomized projection methods for convex feasibility problems: conditioning and convergence rates , 2018, 1801.04873.

[35]  Peter Richtárik,et al.  Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches , 2018, AISTATS.

[36]  Peter Richtárik,et al.  Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory , 2017, SIAM J. Matrix Anal. Appl..

[37]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[38]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[39]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[40]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[41]  Peter Richtárik,et al.  A new perspective on randomized gossip algorithms , 2016, 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[42]  Peter Richtárik,et al.  Linearly convergent stochastic heavy ball method for minimizing generalization error , 2017, ArXiv.

[43]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[44]  Peter Richt'arik,et al.  Simple Complexity Analysis of Simplified Direct Search , 2014, 1410.0390.

[45]  Antonin Chambolle,et al.  Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications , 2017, SIAM J. Optim..