Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent Algorithms via Diffusion Approximation

Diffusion approximation provides weak approximation for stochastic gradient descent algorithms in a finite time horizon. In this paper, we introduce new tools motivated by the backward error analysis of numerical stochastic differential equations into the theoretical framework of diffusion approximation, extending the validity of the weak approximation from finite to infinite time horizon. The new techniques developed in this paper enable us to characterize the asymptotic behavior of constant-step-size SGD algorithms for strongly convex objective functions, a goal previously unreachable within the diffusion approximation framework. Our analysis builds upon a truncated formal power expansion of the solution of a stochastic modified equation arising from diffusion approximation, where the main technical ingredient is a uniform-in-time weak error bound controlling the long-term behavior of the expansion coefficient functions near the global minimum. We expect these new techniques to greatly expand the range of applicability of diffusion approximation to cover wider and deeper aspects of stochastic optimization algorithms in data science.

[1]  Persi Diaconis,et al.  Iterated Random Functions , 1999, SIAM Rev..

[2]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[3]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[4]  Erwan Faou,et al.  Weak Backward Error Analysis for SDEs , 2011, SIAM J. Numer. Anal..

[5]  C. Villani Topics in Optimal Transportation , 2003 .

[6]  Lei Li,et al.  Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations , 2017, 1712.06509.

[7]  Asuman E. Ozdaglar,et al.  A globally convergent incremental Newton method , 2014, Math. Program..

[8]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[9]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[10]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[11]  Prateek Jain,et al.  A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares) , 2017, FSTTCS.

[12]  Philippe von Wurstemberger,et al.  Strong error analysis for stochastic gradient descent optimization algorithms , 2018, 1801.09324.

[13]  A. Stuart,et al.  Gaussian Approximations for Probability Measures on $\mathbf{R}^d$ , 2016, 1611.08642.

[14]  E Weinan,et al.  Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations , 2018, J. Mach. Learn. Res..

[15]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[16]  Stefano Soatto,et al.  Deep relaxation: partial differential equations for optimizing deep neural networks , 2017, Research in the Mathematical Sciences.

[17]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[18]  S. Osher,et al.  Sparse Recovery via Differential Inclusions , 2014, 1406.7728.

[19]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[20]  M. Kopec Weak backward error analysis for Langevin process , 2013, 1310.2599.

[21]  Konstantinos Spiliopoulos,et al.  Stochastic Gradient Descent in Continuous Time , 2016, SIAM J. Financial Math..

[22]  H. Robbins A Stochastic Approximation Method , 1951 .

[23]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[24]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[25]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[26]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[27]  Justin A. Sirignano,et al.  Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem , 2017, Stochastic Systems.

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[30]  M. Kopec Weak backward error analysis for overdamped Langevin processes , 2013, 1310.2404.

[31]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[32]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[33]  Simone G. O. Fiori,et al.  Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial , 2005, J. Mach. Learn. Res..

[34]  Hans-Bernd Dürr,et al.  A smooth vector field for quadratic programming , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[35]  Tony Shardlow,et al.  Modified Equations for Stochastic Differential Equations , 2006 .

[36]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[37]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[38]  Assyr Abdulle,et al.  High Order Numerical Approximation of the Invariant Measure of Ergodic SDEs , 2014, SIAM J. Numer. Anal..

[39]  U. Helmke,et al.  Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[40]  Assyr Abdulle,et al.  High Weak Order Methods for Stochastic Differential Equations Based on Modified Equations , 2012, SIAM J. Sci. Comput..

[41]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[42]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[43]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.