Finite-sum smooth optimization with SARAH

The total complexity (measured as the total number of gradient computations) of a stochastic first-order optimization algorithm that finds a first-order stationary point of a finite-sum smooth nonconvex objective function $F(w)=\frac{1}{n} \sum_{i=1}^n f_i(w)$ has been proven to be at least $\Omega(\sqrt{n}/\epsilon)$ for $n \leq \mathcal{O}(\epsilon^{-2})$ where $\epsilon$ denotes the attained accuracy $\mathbb{E}[ \|\nabla F(\tilde{w})\|^2] \leq \epsilon$ for the outputted approximation $\tilde{w}$ (Fang et al., 2018). In this paper, we provide a convergence analysis for a slightly modified version of the SARAH algorithm (Nguyen et al., 2017a;b) and achieve total complexity that matches the lower-bound worst case complexity in (Fang et al., 2018) up to a constant factor when $n \leq \mathcal{O}(\epsilon^{-2})$ for nonconvex problems. For convex optimization, we propose SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems; and we provide a practical version for which numerical experiments on various datasets show an improved performance.

[1]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[2]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[3]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Zeyuan Allen Zhu,et al.  Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter , 2017, ArXiv.

[6]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[7]  H. Robbins A Stochastic Approximation Method , 1951 .

[8]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[9]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[10]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[11]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[12]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[13]  Zeyuan Allen-Zhu,et al.  Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-Convex Parameter , 2017, ICML.

[14]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[15]  Yi Zhou,et al.  SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization , 2018, ArXiv.

[16]  Jie Liu,et al.  Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[17]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[18]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[19]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[20]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[21]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[22]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[23]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..