The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure

There is a stark disparity between the step size schedules used in practical large scale machine learning and those that are considered optimal by the theory of stochastic approximation. In theory, most results utilize polynomially decaying learning rate schedules, while, in practice, the "Step Decay" schedule is among the most popular schedules, where the learning rate is cut every constant number of epochs (i.e. this is a geometrically decaying schedule). This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where we show that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work. We focus specifically on the rate that is achievable when using the final iterate of stochastic gradient descent, as is commonly done in practice. Our main result provably shows that a properly tuned geometrically decaying learning rate schedule provides an exponential improvement (in terms of the condition number) over any polynomially decaying learning rate schedule. We also provide experimental support for wider applicability of these results, including for training modern deep neural networks.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  J. Nagumo,et al.  A learning method for system identification , 1967, IEEE Transactions on Automatic Control.

[3]  F. Downton Stochastic Approximation , 1969, Nature.

[4]  M. T. Wasan Stochastic Approximation , 1969 .

[5]  D. Anbar On Optimal Estimation Methods Using Stochastic Approximation Procedures , 1973 .

[6]  J. Proakis,et al.  Channel identification for high speed digital communications , 1974 .

[7]  Jean-Louis Goffin,et al.  On convergence rates of subgradient optimization methods , 1977, Math. Program..

[8]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[9]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[10]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[11]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[12]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[13]  GeorgeA. Silver Switzerland , 1989, The Lancet.

[14]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[15]  John J. Shynk,et al.  Analysis of the momentum LMS algorithm , 1990, IEEE Trans. Acoust. Speech Signal Process..

[16]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[17]  G. Pflug,et al.  Stochastic approximation and optimization of random systems , 1992 .

[18]  William A. Sethares,et al.  Analysis of momentum adaptive filtering algorithms , 1998, IEEE Trans. Signal Process..

[19]  Aravaipa Canyon Basin Volume 3 , 2012, Journal of Diabetes Investigation.

[20]  Vivek S. Borkar,et al.  Stochastic approximation algorithms: Overview and recent trends , 1999 .

[21]  T. Lai Stochastic approximation: invited paper , 2003 .

[22]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[27]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[28]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[29]  Ohad Shamir,et al.  Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent? , 2012, COLT.

[30]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[31]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[34]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[35]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[36]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[37]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[38]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[41]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, NIPS.

[42]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[43]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[44]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[45]  Francis R. Bach,et al.  Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[46]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  G. Casella,et al.  Springer Texts in Statistics , 2016 .

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tianbao Yang,et al.  Accelerate Stochastic Subgradient Method by Leveraging Local Error Bound , 2016, ArXiv.

[51]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[52]  A. Hall,et al.  Adaptive Switching Circuits , 2016 .

[53]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[54]  David C. Paris Better , 2017 .

[55]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[56]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[57]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[58]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[59]  Prateek Jain,et al.  A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares) , 2017, FSTTCS.

[60]  Lorenzo Rosasco,et al.  Iterate averaging as regularization for stochastic gradient descent , 2018, COLT.

[61]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[62]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[63]  Rong Jin,et al.  Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD? , 2018, ArXiv.

[64]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically , 2018, NIPS 2018.

[65]  Prateek Jain,et al.  Making the Last Iterate of SGD Information Theoretically Optimal , 2019, COLT.

[66]  Dmitriy Drusvyatskiy,et al.  Robust stochastic optimization with the proximal point method , 2019, ArXiv.

[67]  Nicholas J. A. Harvey,et al.  Tight Analyses for Non-Smooth Stochastic Gradient Descent , 2018, COLT.

[68]  Dmitriy Drusvyatskiy,et al.  Stochastic algorithms with geometric step decay converge linearly on sharp functions , 2019, Mathematical Programming.

[69]  Asuman E. Ozdaglar,et al.  A Universally Optimal Multistage Accelerated Stochastic Gradient Method , 2019, NeurIPS.

[70]  Julien Mairal,et al.  A Generic Acceleration Framework for Stochastic Composite Optimization , 2019, NeurIPS.

[71]  Sakinah,et al.  Vol. , 2020, New Medit.