Adding Gradient Noise Improves Learning for Very Deep Networks

Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Unlike classical weight noise, gradient noise injection is complementary to advanced stochastic optimization algorithms such as Adam and AdaGrad. The technique not only helps to avoid overfitting, but also can result in lower training loss. We see consistent improvements in performance across an array of complex models, including state-of-the-art deep networks for question answering and algorithm learning. We observe that this optimization strategy allows a fully-connected 20-layer deep network to escape a bad initialization with standard stochastic gradient descent. We encourage further application of this technique to additional modern neural architectures.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[3]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[4]  Alan F. Murray,et al.  Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory , 1992, NIPS.

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Mark Steijvers,et al.  A Recurrent Network that performs a Context-Sensitive Prediction Task , 1996 .

[7]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[11]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[12]  H. Robbins A Stochastic Approximation Method , 1951 .

[13]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[18]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[19]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[20]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[21]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[24]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[25]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[26]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[27]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[28]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  David Sussillo,et al.  Random Walks: Training Very Deep Nonlinear Feed-Forward Networks with Smart Initialization , 2014, ArXiv.

[33]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[34]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[35]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[36]  Kam-Fai Wong,et al.  Towards Neural Network-based Reasoning , 2015, ArXiv.

[37]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[38]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[39]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[42]  Kaisheng Yao,et al.  Depth-Gated Recurrent Neural Networks , 2015 .

[43]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[44]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[46]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[47]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[48]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[49]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[50]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[51]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[52]  Zhe Gan,et al.  Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization , 2015, AISTATS.

[53]  Marcin Andrychowicz,et al.  Neural Random Access Machines , 2015, ERCIM News.

[54]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[55]  Quoc V. Le,et al.  Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[56]  Venu Govindaraju,et al.  Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[57]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.