SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition, their iteration complexities have a explicitly linear dependence on the dimensionality of the problem. To alleviate this bad dependence, we propose a simple yet novel variant of ADAGRAD for stochastic (weakly) strongly convex optimization. Different from existing variants, the proposed variant (referred to as SADAGRAD) uses an adaptive restarting scheme in which (i) ADAGRAD serves as a sub-routine and is restarted periodically; (ii) the number of iterations for restarting ADAGRAD depends on the history of learning that incorporates knowledge of the geometry of the data. In addition to the adaptive proximal functions and adaptive number of iterations for restarting, we also develop a variant that is adaptive to the (implicit) strong convexity from the data, which together makes the proposed algorithm strongly adaptive. In the worst case SADAGRAD has an O(1/ ) iteration complexity for finding an -optimal solution similar to other variants. However, it could enjoy faster convergence and much better dependence on the problem’s dimensionality when stochastic gradients are sparse. Extensive experiments on large-scale data sets demonstrate the efficiency of the proposed algorithms in comparison with several variants of ADAGRAD and stochastic gradient method.

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[3]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[4]  Tianbao Yang,et al.  Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence , 2017, ICML.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Wouter M. Koolen,et al.  MetaGrad: Multiple Learning Rates in Online Learning , 2016, NIPS.

[7]  Yurii Nesterov,et al.  Linear convergence of first order methods for non-strongly convex optimization , 2015, Math. Program..

[8]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[9]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[12]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[13]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[14]  Panos Toulis,et al.  Convergence diagnostics for stochastic gradient descent with constant learning rate , 2018, AISTATS.

[15]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[16]  Guoyin Li,et al.  Global error bounds for piecewise convex polynomials , 2013, Math. Program..

[17]  Matthias Hein,et al.  Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.