Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

1. Appendix A derives the Volterra equation and proves the main result for the homogenized SGD (Theorem 1). 2. We show in Appendix B a heuristic derivation of the homogenized SGD approximation to the SDA class of algorithms on the least squares problem and we show that SGD and homogenized SGD are close under orthogonal invariance (Theorem 2). 3. We give in Appendix C a general overview of the analysis of a convolution Volterra equation of the type that arises in the SDA class. 4. Appendix D details the analysis of the homogenized SGD for SDANA, including averagecase analysis and near optimal parameters. 5. Appendix E has the details showing equivalence of SDAHB with SHB as well as general average-case complexity and parameter selections. 6. Appendix F contains details on the simulations.

[1]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[2]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[3]  Julien Mairal,et al.  A Generic Acceleration Framework for Stochastic Composite Optimization , 2019, NeurIPS.

[4]  Asuman E. Ozdaglar,et al.  Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions , 2018, SIAM J. Optim..

[5]  Upendra Dave,et al.  Applied Probability and Queues , 1987 .

[6]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[7]  Adam M. Oberman,et al.  Nesterov's method with decreasing learning rate leads to accelerated stochastic gradient descent , 2019 .

[8]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[9]  Stephen J. Roberts,et al.  Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training , 2020, J. Mach. Learn. Res..

[10]  Aaron Defazio,et al.  Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball , 2021, COLT.

[11]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[12]  E Weinan,et al.  Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations , 2018, J. Mach. Learn. Res..

[13]  Mert Gürbüzbalaban,et al.  Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances , 2019, ICML.

[14]  Fabian Pedregosa,et al.  SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality , 2021, COLT.

[15]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[16]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[17]  Michael G. Rabbat,et al.  On the Convergence of Nesterov's Accelerated Gradient Method in Stochastic Settings , 2020, ICML.

[18]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[19]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[20]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[21]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[22]  F. Krzakala,et al.  Asymptotic Errors for High-Dimensional Convex Penalized Linear Regression beyond Gaussian Matrices. , 2020, COLT 2020.

[23]  Kristian Kirsch,et al.  Theory Of Ordinary Differential Equations , 2016 .

[24]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[25]  Aurélien Lucchi,et al.  The Role of Memory in Stochastic Optimization , 2019, UAI.

[26]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[27]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[28]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[29]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[30]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[31]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[32]  G. Samuel Jordan,et al.  VOLTERRA INTEGRAL AND FUNCTIONAL EQUATIONS (Encyclopedia of Mathematics and its Applications 34) , 1991 .

[33]  Peter Richtárik,et al.  Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods , 2017, Computational Optimization and Applications.

[34]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[35]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[36]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[37]  Yi Yang,et al.  A Unified Analysis of Stochastic Momentum Methods for Deep Learning , 2018, IJCAI.

[38]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[39]  Mikhail Belkin,et al.  Accelerating SGD with momentum for over-parameterized learning , 2018, ICLR.

[40]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..