Dynamic Evaluation of Neural Sequence Models

We present methodology for using dynamic evaluation to improve neural sequence models. Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns. Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities on the Penn Treebank and WikiText-2 datasets to 51.1 and 44.3 respectively, and the state-of-the-art character-level cross-entropies on the text8 and Hutter Prize datasets to 1.19 bits/char and 1.08 bits/char respectively.

[1]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[2]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[3]  Hang Li,et al.  Incorporating Copying Mechanism in Sequence-to-Sequence Learning , 2016, ACL.

[4]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[5]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[6]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[7]  Razvan Pascanu,et al.  Memory-based Parameter Adaptation , 2018, ICLR.

[8]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[12]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[13]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[14]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[15]  Haim H. Permuter,et al.  Gradual Learning of Deep Recurrent Neural Networks , 2017, ArXiv.

[16]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[18]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[19]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[22]  Roland Kuhn,et al.  Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language , 1988, COLING.

[23]  Ying Zhang,et al.  On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[24]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[25]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[26]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[27]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Angelika Steger,et al.  Fast-Slow Recurrent Neural Networks , 2017, NIPS.

[30]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[31]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[32]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[33]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[34]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[35]  Guodong Zhou,et al.  Cache-based Document-level Neural Machine Translation , 2017, ArXiv.

[36]  Yang Liu,et al.  Learning to Remember Translation History with a Continuous Cache , 2017, TACL.

[37]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[38]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[39]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[40]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[41]  David Reitter,et al.  Learning Simpler Language Models with the Differential State Framework , 2017, Neural Computation.

[42]  Brandon Prickett,et al.  Vanilla Sequence-to-Sequence Neural Nets cannot Model Reduplication , 2017 .

[43]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[44]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[45]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[46]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[47]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[48]  Oriol Vinyals,et al.  Bayesian Recurrent Neural Networks , 2017, ArXiv.

[49]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.