On the State of the Art of Evaluation in Neural Language Models

Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  J. Hintze,et al.  Violin plots : A box plot-density trace synergism , 1998 .

[4]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[5]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[6]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[7]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[8]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[11]  Ying Zhang,et al.  On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[12]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[13]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[14]  Erhardt Barth,et al.  Recurrent Dropout without Memory Loss , 2016, COLING.

[15]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[16]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[17]  Jascha Sohl-Dickstein,et al.  Capacity and Trainability in Recurrent Neural Networks , 2016, ICLR.

[18]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[19]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[20]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[21]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[22]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[23]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[24]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[25]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[26]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[27]  Steve Renals,et al.  Dynamic Evaluation of Neural Sequence Models , 2017, ICML.