A robust adaptive stochastic gradient method for deep learning

Stochastic gradient algorithms are the main focus of large-scale optimization problems and led to important successes in the recent advancement of the deep learning algorithms. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose an adaptive learning rate algorithm, which utilizes stochastic curvature information of the loss function for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.1

[1]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[2]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[3]  Yuri Levin,et al.  Directional Newton methods in n variables , 2002, Math. Comput..

[4]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[5]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[6]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[7]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[8]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[9]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[10]  Tom Schaul,et al.  Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients , 2013, ICLR.

[11]  Z. Bai,et al.  Directional secant method for nonlinear equations , 2005 .

[12]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Marcus Liwicki,et al.  IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[17]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Yoshua Bengio,et al.  ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient , 2014, ArXiv.

[20]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[23]  Razvan Pascanu,et al.  M L ] 2 0 A ug 2 01 3 Pylearn 2 : a machine learning research library , 2014 .

[24]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[25]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[26]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.