Rethinking the Variational Interpretation of Nesterov's Accelerated Method

The continuous-time model of Nesterov’s momentum provides a thought-provoking perspective for understanding the nature of the acceleration phenomenon in convex optimization. One of the main ideas in this line of research comes from the field of classical mechanics and proposes to link Nesterov’s trajectory to the solution of a set of Euler-Lagrange equations relative to the so-called Bregman Lagrangian. In the last years, this approach led to the discovery of many new (stochastic) accelerated algorithms and provided a solid theoretical foundation for the design of structure-preserving accelerated methods. In this work, we revisit this idea and provide an in-depth analysis of the action relative to the Bregman Lagrangian from the point of view of calculus of variations. Our main finding is that, while Nesterov’s method is a stationary point for the action, it is often not a minimizer but instead a saddle point for this functional in the space of differentiable curves. This finding challenges the main intuition behind the variational interpretation of Nesterov’s method and provides additional insights into the intriguing geometry of accelerated paths.

[1]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[2]  B. Talukdar,et al.  Symmetries and conservation laws of the damped harmonic oscillator , 2008 .

[3]  Daniel Kunin,et al.  Noether's Learning Dynamics: The Role of Kinetic Symmetry Breaking in Deep Learning , 2021, ArXiv.

[4]  Kwangjun Ahn From Proximal Point Method to Nesterov's Acceleration , 2020, ArXiv.

[5]  Ameet Talwalkar,et al.  Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[6]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[7]  Antonio Orvieto,et al.  A Continuous-time Perspective for Modeling Acceleration in Riemannian Optimization , 2020, AISTATS.

[8]  V. Arnold Mathematical Methods of Classical Mechanics , 1974 .

[9]  Andr'as Szegleti,et al.  Dissipation in Lagrangian Formalism , 2020, Entropy.

[10]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[11]  Quanquan Gu,et al.  Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms , 2018, AISTATS.

[12]  Rudiger Urbanke,et al.  Noether: The More Things Change, the More Stay the Same , 2021, ArXiv.

[13]  Michael I. Jordan,et al.  Generalized Momentum-Based Methods: A Hamiltonian Perspective , 2019, SIAM J. Optim..

[14]  Michael I. Jordan,et al.  On dissipative symplectic integration with applications to gradient-based optimization , 2020 .

[15]  Andrea Braides Gamma-Convergence for Beginners , 2002 .

[16]  Melvin Leok,et al.  A Variational Formulation of Accelerated Optimization on Riemannian Manifolds , 2021, SIAM J. Math. Data Sci..

[17]  L. Milne‐Thomson A Treatise on the Theory of Bessel Functions , 1945, Nature.

[18]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[19]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[20]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[21]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[22]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[23]  Andre Wibisono,et al.  Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions , 2019, NeurIPS.

[24]  Michael I. Jordan DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[25]  Aaron Defazio,et al.  On the Curved Geometry of Accelerated Optimization , 2018, NeurIPS.

[26]  Michael I. Jordan,et al.  Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives , 2020, ArXiv.

[27]  Philippe Casgrain,et al.  A Latent Variational Framework for Stochastic Optimization , 2019, NeurIPS.

[28]  S. Brendle,et al.  Calculus of Variations , 1927, Nature.

[29]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[30]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.