SVRG meets SAGA: k-SVRG - A Tale of Limited Memory

In recent years, many variance reduced algorithms for empirical risk minimization have been introduced. In contrast to vanilla SGD, these methods converge linearly on strong convex problems. To obtain the variance reduction, current methods either require frequent passes over the full data to recompute gradients---without making any progress during this time (like in SVRG), or they require memory of the same size as the input problem (like SAGA). In this work, we propose k-SVRG, an algorithm that interpolates between those two extremes: it makes best use of the available memory and in turn does avoid full passes over the data without making progress. We prove linear convergence of k-SVRG on strongly convex problems and convergence to stationary points on non-convex problems. Numerical experiments show the effectiveness of our method.

[1]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[2]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[3]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[4]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[5]  Alexander J. Smola,et al.  Fast incremental method for smooth nonconvex optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[6]  H. Robbins A Stochastic Approximation Method , 1951 .

[7]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[8]  Brian P. Dawkins Siobhan's Problem: The Coupon Collector Revisited , 1991 .

[9]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[10]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[11]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[12]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[13]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[14]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[15]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[16]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[17]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[18]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[19]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[20]  L. Holst On Birthday, Collectors', Occupancy and Other Classical Urn Problems , 1986 .

[21]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[22]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[23]  Zaïd Harchaoui,et al.  Catalyst for Gradient-based Nonconvex Optimization , 2018, AISTATS.

[24]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[25]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[26]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[27]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[28]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.