Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis

Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.

[1]  Ashok Cutkosky,et al.  Momentum Improves Normalized SGD , 2020, ICML.

[2]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[3]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[4]  Prateek Jain,et al.  Making the Last Iterate of SGD Information Theoretically Optimal , 2019, COLT.

[5]  Yi Yang,et al.  A Unified Analysis of Stochastic Momentum Methods for Deep Learning , 2018, IJCAI.

[6]  Aaron Defazio,et al.  On the convergence of the Stochastic Heavy Ball Method , 2020, ArXiv.

[7]  Ali H. Sayed,et al.  On the influence of momentum acceleration on online learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[9]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[10]  Aaron Defazio,et al.  On the Curved Geometry of Accelerated Optimization , 2018, NeurIPS.

[11]  Zhisong Pan,et al.  Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence , 2020, IEEE Transactions on Cybernetics.

[12]  S. Gadat,et al.  Stochastic Heavy ball , 2016, 1609.04228.

[13]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[14]  Alexandre M. Bayen,et al.  Adaptive Averaging in Accelerated Descent Dynamics , 2016, NIPS.

[15]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[16]  Euhanna Ghadimi,et al.  Global convergence of the Heavy-ball method for convex optimization , 2014, 2015 European Control Conference (ECC).

[17]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[18]  Mert Gürbüzbalaban,et al.  Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances , 2019, ICML.

[19]  Aaron Defazio,et al.  Factorial Powers for Stochastic Optimization , 2020, ArXiv.

[20]  Wotao Yin,et al.  An Improved Analysis of Stochastic Gradient Descent with Momentum , 2020, NeurIPS.

[21]  Mikael Johansson,et al.  Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization , 2020, ICML.

[22]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[23]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.