Adaptive gradient descent without descent

We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don't increase the stepsize too fast and 2) don't overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on smoothness in a neighborhood of a solution. Given that the problem is convex, our method will converge even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including matrix factorization and training of ResNet-18.

[1]  A. Goldstein Cauchy's method of minimization , 1962 .

[2]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[3]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[4]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[5]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[6]  M. Raydan On the Barzilai and Borwein choice of steplength for the gradient method , 1993 .

[7]  Marcos Raydan,et al.  The Barzilai and Borwein Gradient Method for the Large Scale Unconstrained Minimization Problem , 1997, SIAM J. Optim..

[8]  L. Liao,et al.  R-linear convergence of the Barzilai and Borwein gradient method , 2002 .

[9]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[10]  Roger Fletcher,et al.  Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming , 2005, Numerische Mathematik.

[11]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Nikhil R. Devanur,et al.  Distributed algorithms via gradient descent for fisher markets , 2011, EC '11.

[16]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[17]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..

[18]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[19]  Marc Teboulle,et al.  Performance of first-order methods for smooth convex minimization: a novel approach , 2012, Mathematical Programming.

[20]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Donghwan Kim,et al.  Optimized first-order methods for smooth convex minimization , 2014, Mathematical Programming.

[23]  José Yunier Bello Cruz,et al.  On the convergence of the forward–backward splitting method with linesearches , 2015, Optim. Methods Softw..

[24]  Saverio Salzo,et al.  The Variable Metric Forward-Backward Splitting Algorithm Under Mild Differentiability Assumptions , 2016, SIAM J. Optim..

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  Adrien B. Taylor,et al.  Smooth strongly convex interpolation and exact worst-case performance of first-order methods , 2015, Mathematical Programming.

[27]  Marc Teboulle,et al.  A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications , 2017, Math. Oper. Res..

[28]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[29]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[30]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[31]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[32]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[33]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[34]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[35]  O. Burdakov,et al.  Stabilized Barzilai-Borwein Method , 2019, Journal of Computational Mathematics.

[36]  Francis Bach,et al.  Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions , 2019, COLT.

[37]  Zheng Qu,et al.  Adaptive restart of accelerated gradient methods under local quadratic growth condition , 2017, IMA Journal of Numerical Analysis.

[38]  S. Kakade,et al.  Revisiting the Polyak step size , 2019, 1905.00313.

[39]  Yura Malitsky,et al.  Golden ratio algorithms for variational inequalities , 2018, Mathematical Programming.

[40]  Yee Whye Teh,et al.  Dual Space Preconditioning for Gradient Descent , 2019, SIAM J. Optim..