Equilibrated adaptive learning rates for non-convex optimization

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

[1]  A. Sluis Condition numbers and equilibration of matrices , 1969 .

[2]  Charles R. Johnson,et al.  A Simple Estimate of the Condition Number of a Linear System , 1995 .

[3]  B. Datta Numerical Linear Algebra and Applications , 1995 .

[4]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[5]  Y. Saad,et al.  An estimator for the diagonal of a matrix , 2007 .

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[8]  Biswa Nath Datta Numerical Linear Algebra and Applications, Second Edition , 2010 .

[9]  W. Murray,et al.  Matrix-Free Approximate Equilibration , 2011, 1110.2805.

[10]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[11]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[13]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[14]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[15]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[16]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[17]  Tom Schaul,et al.  Unit Tests for Stochastic Optimization , 2013, ICLR.

[18]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[19]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[20]  Volkan Cevher,et al.  Stochastic Spectral Descent for Restricted Boltzmann Machines , 2015, AISTATS.