Improving Neural Language Models with a Continuous Cache

We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.

[1]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Roland Kuhn,et al.  Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language , 1988, COLING.

[4]  Julian Kupiec,et al.  Probabilistic Models of Short and Long Distance Word Dependencies in Running Text , 1989, HLT.

[5]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[6]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[7]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[8]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[10]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Reinhard Kneser,et al.  On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[15]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[17]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Daniel Jurafsky,et al.  Towards better integration of semantic predictors in statistical language modeling , 1998, ICSLP.

[20]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[21]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[22]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[23]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[24]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[27]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[28]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[29]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[31]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[32]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[33]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[34]  Kyunghyun Cho,et al.  Larger-Context Language Modelling , 2015, ArXiv.

[35]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[36]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[37]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[38]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[39]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[41]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[42]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[43]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[44]  Bowen Zhou,et al.  Pointing the Unknown Words , 2016, ACL.

[45]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[46]  Xiang Zhang,et al.  Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems , 2015, ICLR.

[47]  Rudolf Kadlec,et al.  Text Understanding with the Attention Sum Reader Network , 2016, ACL.

[48]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[49]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[50]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.