Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

The stochastic gradient descent (SGD) optimization algorithm plays a central role in a series of machine learning applications. The scientific literature provides a vast amount of upper error bounds for the SGD method. Much less attention as been paid to proving lower error bounds for the SGD method. It is the key contribution of this paper to make a step in this direction. More precisely, in this article we establish for every $\gamma, \nu \in (0,\infty)$ essentially matching lower and upper bounds for the mean square error of the SGD process with learning rates $(\frac{\gamma}{n^\nu})_{n \in \mathbb{N}}$ associated to a simple quadratic stochastic optimization problem. This allows us to precisely quantify the mean square convergence rate of the SGD method in dependence on the asymptotic behavior of the learning rates.

[1]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[2]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[3]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[4]  Vivak Patel On SGD's Failure in Practice: Characterizing and Overcoming Stalling , 2017, ArXiv.

[5]  S. Dereich,et al.  General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type , 2015, Numerische Mathematik.

[6]  K. Ritter,et al.  Minimal Errors for Strong and Weak Approximation of Stochastic Differential Equations , 2008 .

[7]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[8]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  G. Kersting A Weak Convergence Theorem with Application to the Robbins-Monro Process , 1978 .

[11]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[14]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[15]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  K. Chung On a Stochastic Approximation Method , 1954 .

[18]  E Weinan,et al.  Dynamics of Stochastic Gradient Algorithms , 2015, ArXiv.

[19]  P. Révész,et al.  A limit theorem for the Robbins-Monro approximation , 1973 .

[20]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[23]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[24]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[25]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[26]  Lam M. Nguyen,et al.  When Does Stochastic Gradient Algorithm Work Well? , 2018, ArXiv.

[27]  Philippe von Wurstemberger,et al.  Strong error analysis for stochastic gradient descent optimization algorithms , 2018, 1801.09324.

[28]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[29]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[30]  Marten van Dijk,et al.  Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD , 2018, NeurIPS.

[31]  Tzay Y. Young,et al.  Error bounds for stochastic estimation of signal parameters , 1971, IEEE Trans. Inf. Theory.

[32]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[33]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[34]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[35]  Huy N. Chau,et al.  On fixed gain recursive estimators with discontinuity in the parameters , 2016, ESAIM: Probability and Statistics.

[36]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .