Adaptive Step Sizes in Variance Reduction via Regularization

The main goal of this work is equipping convex and nonconvex problems with Barzilai-Borwein (BB) step size. With the adaptivity of BB step sizes granted, they can fail when the objective function is not strongly convex. To overcome this challenge, the key idea here is to bridge (non)convex problems and strongly convex ones via regularization. The proposed regularization schemes are \textit{simple} yet effective. Wedding the BB step size with a variance reduction method, known as SARAH, offers a free lunch compared with vanilla SARAH in convex problems. The convergence of BB step sizes in nonconvex problems is also established and its complexity is no worse than other adaptive step sizes such as AdaGrad. As a byproduct, our regularized SARAH methods for convex functions ensure that the complexity to find $\mathbb{E}[\| \nabla f(\mathbf{x}) \|^2]\leq \epsilon$ is ${\cal O}\big( (n+\frac{1}{\sqrt{\epsilon}})\ln{\frac{1}{\epsilon}}\big)$, improving $\epsilon$ dependence over existing results. Numerical tests further validate the merits of proposed approaches.

[1]  Wei Liu,et al.  Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size , 2017, AAAI.

[2]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  Julien Mairal,et al.  Estimate Sequences for Variance-Reduced Stochastic Composite Optimization , 2019, ICML.

[5]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[6]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[7]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[8]  H. Robbins A Stochastic Approximation Method , 1951 .

[9]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[10]  Ohad Shamir,et al.  The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[11]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[12]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[13]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[14]  Lam M. Nguyen,et al.  Inexact SARAH algorithm for stochastic optimization , 2018, Optim. Methods Softw..

[15]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[16]  Georgios B. Giannakis,et al.  On the Convergence of SARAH and Beyond , 2019, AISTATS.

[17]  Marten van Dijk,et al.  Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[20]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[21]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[22]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Alexander J. Smola,et al.  Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[25]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[26]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[27]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[28]  Yiming Yang,et al.  A Loss Function Analysis for Classification Methods in Text Categorization , 2003, ICML.

[29]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[30]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[31]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[32]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[33]  Tiande Guo,et al.  A Class of Stochastic Variance Reduced Methods with an Adaptive Stepsize , 2019 .

[34]  Marten van Dijk,et al.  Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH , 2019, ArXiv.

[35]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[36]  Cheng Wang,et al.  Accelerating Mini-batch SARAH by Step Size Rules , 2019, Inf. Sci..

[37]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[38]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[39]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[40]  Georgios B. Giannakis,et al.  Almost Tune-Free Variance Reduction , 2019, ICML.

[41]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.