The Role of Memory in Stochastic Optimization

The choice of how to retain information about past gradients dramatically affects the convergence properties of state-of-the-art stochastic optimization methods, such as Heavy-ball, Nesterov's momentum, RMSprop and Adam. Building on this observation, we use stochastic differential equations (SDEs) to explicitly study the role of memory in gradient-based algorithms. We first derive a general continuous-time model that can incorporate arbitrary types of memory, for both deterministic and stochastic settings. We provide convergence guarantees for this SDE for weakly-quasi-convex and quadratically growing functions. We then demonstrate how to discretize this SDE to get a flexible discrete-time algorithm that can implement a board spectrum of memories ranging from short- to long-term. Not only does this algorithm increase the degrees of freedom in algorithmic choice for practitioners but it also comes with better stability properties than classical momentum in the convex stochastic setting. In particular, no iterate averaging is needed for convergence. Interestingly, our analysis also provides a novel interpretation of Nesterov's momentum as stable gradient amplification and highlights a possible reason for its unstable behavior in the (convex) stochastic setting. Furthermore, we discuss the use of long term memory for second-moment estimation in adaptive methods, such as Adam and RMSprop. Finally, we provide an extensive experimental study of the effect of different types of memory in both convex and nonconvex settings.

[1]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[2]  Aurélien Lucchi,et al.  Continuous-time Models for Stochastic Optimization Algorithms , 2018, NeurIPS.

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Quanquan Gu,et al.  Continuous and Discrete-time Accelerated Stochastic Mirror Descent for Strongly Convex Functions , 2018, ICML.

[5]  G. N. Mil’shtejn Approximate Integration of Stochastic Differential Equations , 1975 .

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Aryan Mokhtari,et al.  Direct Runge-Kutta Discretization Achieves Acceleration , 2018, NeurIPS.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[10]  P. Olver Nonlinear Systems , 2013 .

[11]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[12]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[13]  Mathias Staudigl,et al.  On the convergence of gradient-like flows with noisy gradient input , 2016, SIAM J. Optim..

[14]  Michael I. Jordan,et al.  Acceleration via Symplectic Discretization of High-Resolution Differential Equations , 2019, NeurIPS.

[15]  Euhanna Ghadimi,et al.  Global convergence of the Heavy-ball method for convex optimization , 2014, 2015 European Control Conference (ECC).

[16]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[17]  V. Arnold Mathematical Methods of Classical Mechanics , 1974 .

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[20]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[21]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[22]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[23]  Sébastien Gadat,et al.  Long time behaviour and stationary regime of memory gradient diffusions , 2014 .

[24]  Ali H. Sayed,et al.  On the influence of momentum acceleration on online learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[26]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[27]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[28]  Vladimir Braverman,et al.  The Physical Systems Behind Optimization Algorithms , 2018, NeurIPS.

[29]  Michael I. Jordan,et al.  Understanding the acceleration phenomenon via high-resolution differential equations , 2018, Mathematical Programming.

[30]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[31]  Peter L. Bartlett,et al.  Acceleration and Averaging in Stochastic Descent Dynamics , 2017, NIPS.

[32]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[33]  Jessica Fuerst,et al.  Stochastic Differential Equations And Applications , 2016 .

[34]  E. Hairer,et al.  Geometric Numerical Integration: Structure Preserving Algorithms for Ordinary Differential Equations , 2004 .

[35]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[36]  S. Gadat,et al.  On the long time behavior of second order differential equations with asymptotically small dissipation , 2007, 0710.1107.

[37]  Jerry Ma,et al.  Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.

[38]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[39]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[40]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[41]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[42]  H. Robbins A Stochastic Approximation Method , 1951 .

[43]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[44]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[45]  Quanquan Gu,et al.  Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms , 2018, AISTATS.

[46]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..