ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient

Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.

[1]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[2]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[3]  Yuri Levin,et al.  Directional Newton methods in n variables , 2002, Math. Comput..

[4]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[5]  Z. Bai,et al.  Directional secant method for nonlinear equations , 2005 .

[6]  H. Robbins A Stochastic Approximation Method , 1951 .

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[9]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[10]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[11]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[12]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[13]  Ian J. Goodfellow,et al.  Pylearn2: a machine learning research library , 2013, ArXiv.

[14]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[15]  Tom Schaul,et al.  Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients , 2013, ICLR.

[16]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[17]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.