Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning. Popular examples include training deep neural networks, dimensionality reduction, and etc. Due to the lack of convexity and the extra momentum term, the optimization theory of MSGD is still largely unknown. In this paper, we study this fundamental optimization algorithm based on the so-called "strict saddle problem." By diffusion approximation type analysis, our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks. Moreover, our analysis applies the martingale method and "Fixed-State-Chain" method from the stochastic approximation literature, which are of independent interest.

[1]  Desmond J. Higham,et al.  Numerical Methods for Ordinary Differential Equations - Initial Value Problems , 2010, Springer undergraduate mathematics series.

[2]  Junwei Lu,et al.  Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[3]  Junwei Lu,et al.  Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[4]  Serik Sagitov Weak Convergence of Probability Measures , 2020 .

[5]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[6]  V. Borkar Stochastic approximation with two time scales , 1997 .

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shang Wu,et al.  Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms , 2017, J. Mach. Learn. Res..

[9]  Brian David Nowakowski On Multi-parameter Semimartingales, Their Integrals and Weak Convergence , 2013 .

[10]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[11]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[12]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[13]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[14]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[17]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[18]  Peter L. Bartlett,et al.  Acceleration and Averaging in Stochastic Mirror Descent Dynamics , 2017, 1707.06219.

[19]  Lin F. Yang,et al.  Dropping Convexity for More Efficient and Scalable Online Multiview Learning , 2017, 1702.08134.

[20]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[21]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[22]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[23]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[24]  Tuo Zhao,et al.  Online Multiview Representation Learning: Dropping Convexity for Better Efficiency , 2017, ArXiv.

[25]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[26]  H. Robbins A Stochastic Approximation Method , 1951 .

[27]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[28]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[29]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[30]  Han Liu,et al.  Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes , 2018, NIPS.

[31]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[32]  Georgios Piliouras,et al.  Gradient Descent Converges to Minimizers: The Case of Non-Isolated Critical Points , 2016, ArXiv.

[33]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[34]  T. Poggio,et al.  Memo No . 067 June 27 , 2017 Theory of Deep Learning III : Generalization Properties of SGD , 2017 .

[35]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[36]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[37]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[38]  E. Oja,et al.  On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix , 1985 .

[39]  Le Song,et al.  Deep Hyperspherical Learning , 2017, NIPS.

[40]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.