Understanding the acceleration phenomenon via high-resolution differential equations

Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-) and Polyak’s heavy-ball method—we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG- and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG- but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG- minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG- for smooth convex functions.

[1]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[2]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[3]  V. Arnold Mathematical Methods of Classical Mechanics , 1974 .

[4]  J. Pedlosky Geophysical Fluid Dynamics , 1979 .

[5]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[6]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[7]  C. Blair Problem Complexity and Method Efficiency in Optimization (A. S. Nemirovsky and D. B. Yudin) , 1985 .

[8]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[9]  U. Helmke,et al.  Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[10]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[11]  Michael J. Todd,et al.  Mathematical programming , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[12]  Kathleen Daly Volume 7 , 1998 .

[13]  J. Schropp,et al.  A dynamical systems approach to constrained minimization , 2000 .

[14]  Felipe Alvarez,et al.  On the Minimizing Property of a Second Order Dissipative System in Hilbert Spaces , 2000, SIAM J. Control. Optim..

[15]  Taylor Francis Online Numerical functional analysis and optimization , 2001 .

[16]  J. Bolte,et al.  A second-order gradient-like dissipative dynamical system with Hessian-driven damping.: Application to optimization and mechanics , 2002 .

[17]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[18]  Simone G. O. Fiori,et al.  Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial , 2005, J. Mach. Learn. Res..

[19]  M. Ehler Applied and Computational Harmonic Analysis , 2008 .

[20]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[21]  Nicholas I. M. Gould,et al.  SIAM Journal on Optimization , 2012 .

[22]  H. Attouch,et al.  A second-order differential system with hessian-driven damping; application to non-elastic shock laws , 2012 .

[23]  J. LaFountain Inc. , 2013, American Art.

[24]  S. Osher,et al.  Sparse Recovery via Differential Inclusions , 2014, 1406.7728.

[25]  Juan Peypouquet,et al.  A Dynamical Approach to an Inertial Forward-Backward Algorithm for Convex Minimization , 2014, SIAM J. Optim..

[26]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[27]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[28]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[29]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[30]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[31]  Ramzi May Asymptotic for a second order evolution equation with convex potential and vanishing damping term , 2015, 1509.05598.

[32]  A. Chambolle,et al.  On the Convergence of the Iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm” , 2015, J. Optim. Theory Appl..

[33]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Math. Program..

[34]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[35]  Alexandre M. Bayen,et al.  Adaptive Averaging in Accelerated Descent Dynamics , 2016, NIPS.

[36]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[37]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[38]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[39]  Hedy Attouch,et al.  The Rate of Convergence of Nesterov's Accelerated Forward-Backward Method is Actually Faster Than 1/k2 , 2015, SIAM J. Optim..

[40]  H. Attouch,et al.  Fast convex optimization via inertial dynamics with Hessian driven damping , 2016, Journal of Differential Equations.

[41]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[42]  Bin Hu,et al.  Control interpretations for first-order optimization methods , 2017, 2017 American Control Conference (ACC).

[43]  Zaïd Harchaoui,et al.  Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice , 2017, J. Mach. Learn. Res..

[44]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[45]  Peter L. Bartlett,et al.  Acceleration and Averaging in Stochastic Descent Dynamics , 2017, NIPS.

[46]  Shiqian Ma,et al.  Geometric Descent Method for Convex Composite Minimization , 2017, NIPS.

[47]  Bin Hu,et al.  Dissipativity Theory for Nesterov's Accelerated Method , 2017, ICML.

[48]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[49]  Weijie J. Su,et al.  On the global convergence of a randomly perturbed dissipative nonlinear oscillator , 2017, 1712.05733.

[50]  Hédy Attouch,et al.  Convergence Rates of Inertial Forward-Backward Algorithms , 2018, SIAM J. Optim..

[51]  Alejandro Ribeiro,et al.  Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems , 2017, SIAM J. Optim..

[52]  Quanquan Gu,et al.  Continuous and Discrete-time Accelerated Stochastic Mirror Descent for Strongly Convex Functions , 2018, ICML.

[53]  Aryan Mokhtari,et al.  Direct Runge-Kutta Discretization Achieves Acceleration , 2018, NeurIPS.

[54]  Juan Peypouquet,et al.  Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity , 2018, Math. Program..

[55]  Tie-Yan Liu,et al.  Differential Equations for Modeling Asynchronous Algorithms , 2018, IJCAI.

[56]  Dmitriy Drusvyatskiy,et al.  An Optimal First Order Method Based on Optimal Quadratic Averaging , 2016, SIAM J. Optim..

[57]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[58]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically , 2018, NIPS 2018.

[59]  Jean-François Aujol,et al.  The Differential Inclusion Modeling FISTA Algorithm and Optimality of Convergence Rate in the Case b $\leq3$ , 2018, SIAM J. Optim..

[60]  Mert Gürbüzbalaban,et al.  Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration , 2018, Oper. Res..

[61]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[62]  Tengyuan Liang,et al.  Statistical inference for the population landscape via moment‐adjusted stochastic gradients , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[63]  Jelena Diakonikolas,et al.  The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , 2017, SIAM J. Optim..

[64]  Michael I. Jordan,et al.  Acceleration via Symplectic Discretization of High-Resolution Differential Equations , 2019, NeurIPS.

[65]  H. Attouch,et al.  Rate of convergence of the Nesterov accelerated gradient method in the subcritical case α ≤ 3 , 2017, ESAIM: Control, Optimisation and Calculus of Variations.

[66]  Peter Seiler,et al.  Analysis of the Heavy-ball Algorithm using Integral Quadratic Constraints , 2019, 2019 American Control Conference (ACC).

[67]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[68]  F. Cardoso,et al.  Volume 12 , 2020, Journal of Aerospace Technology and Management.

[69]  Yair Carmon,et al.  Lower bounds for finding stationary points II: first-order methods , 2017, Mathematical Programming.

[70]  Jeffrey A. Fessler,et al.  Optimizing the Efficiency of First-Order Methods for Decreasing the Gradient of Smooth Convex Functions , 2018, J. Optim. Theory Appl..

[71]  Martin J. Wainwright,et al.  High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm , 2019, J. Mach. Learn. Res..

[72]  I. Khabaza Journal of Differential Equations , 2022 .