Noisin: Unbiased Regularization for Recurrent Neural Networks

Recurrent neural networks (RNNs) are powerful models of sequential data. They have been successfully used in domains such as text and speech. However, RNNs are susceptible to overfitting; regularization is important. In this paper we develop Noisin, a new method for regularizing RNNs. Noisin injects random noise into the hidden states of the RNN and then maximizes the corresponding marginal likelihood of the data. We show how Noisin applies to any RNN and we study many different types of noise. Noisin is unbiased--it preserves the underlying RNN on average. We characterize how Noisin regularizes its RNN both theoretically and empirically. On language modeling benchmarks, Noisin improves over dropout by as much as 12.2% on the Penn Treebank and 9.4% on the Wikitext-2 dataset. We also compared the state-of-the-art language model of Yang et al. 2017, both with and without Noisin. On the Penn Treebank, the method with Noisin more quickly reaches state-of-the-art performance.

[1]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[5]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[6]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  C. Lee Giles,et al.  An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[9]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[12]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[13]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[14]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[15]  T. Poggio,et al.  Bagging Regularizes , 2002 .

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Yoshua Bengio,et al.  Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[18]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[19]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[22]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[23]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[24]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[25]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[26]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[27]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[29]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[30]  Christian Osendorfer,et al.  Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[31]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[33]  H. Robbins The Empirical Bayes Approach to Statistical Decision Problems , 1964 .

[34]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Erhardt Barth,et al.  Recurrent Dropout without Memory Loss , 2016, COLING.

[37]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[38]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[39]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[40]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[41]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[42]  Steve Renals,et al.  Dynamic Evaluation of Neural Sequence Models , 2017, ICML.

[43]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[44]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[45]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[46]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.