No more pesky learning rates

The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.

[1]  R. Bucy,et al.  Stability and positive supermartingales , 1965 .

[2]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[3]  Thibault Langlois,et al.  Parameter adaptation in stochastic optimization , 1999 .

[4]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[5]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[6]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[7]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[8]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[17]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[18]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[19]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[20]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[21]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[22]  Tom Schaul,et al.  Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients , 2013, ICLR.