Regularizing and Optimizing LSTM Language Models

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

[1]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[2]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[8]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jürgen Schmidhuber,et al.  Low Complexity Proto-Value Function Learning from Sensory Observations with Incremental Slow Feature Analysis , 2012, ICANN.

[10]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[11]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  George Saon,et al.  A nonmonotone learning rate strategy for SGD training of deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Oliver Brock,et al.  Learning state representations with robotic priors , 2015, Auton. Robots.

[18]  Georgios Piliouras,et al.  Gradient Descent Converges to Minimizers: The Case of Non-Isolated Critical Points , 2016, ArXiv.

[19]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[20]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[21]  Les E. Atlas,et al.  Full-Capacity Unitary Recurrent Neural Networks , 2016, NIPS.

[22]  Muhammad Ghifary,et al.  Strongly-Typed Recurrent Neural Networks , 2016, ICML.

[23]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[24]  Erhardt Barth,et al.  Recurrent Dropout without Memory Loss , 2016, COLING.

[25]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[26]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[27]  Yann LeCun,et al.  Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs , 2016, ICML.

[28]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[29]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[30]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[31]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[32]  Ali Farhadi,et al.  Query-Reduction Networks for Question Answering , 2016, ICLR.

[33]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[34]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[35]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[36]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[37]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[38]  Yann Ollivier,et al.  Unbiasing Truncated Backpropagation Through Time , 2017, ArXiv.

[39]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[40]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[41]  Richard Socher,et al.  Revisiting Activation Regularization for Language RNNs , 2017, ArXiv.

[42]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[43]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.