We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing.
Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM.
LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders.
Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts ($n > 5$); depending on the task and amount of data it can match fully recurrent LSTM models at about $n=13$. This may have implications when modeling short-format text, e.g. voice search/query LMs.
Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.
[1]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[2]
Joshua Goodman,et al.
A bit of progress in language modeling
,
2001,
Comput. Speech Lang..
[3]
Lukás Burget,et al.
Recurrent neural network based language model
,
2010,
INTERSPEECH.
[4]
Yoshua Bengio,et al.
A Neural Probabilistic Language Model
,
2003,
J. Mach. Learn. Res..
[5]
Thorsten Brants,et al.
Large Language Models in Machine Translation
,
2007,
EMNLP.
[6]
Yoram Singer,et al.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
,
2011,
J. Mach. Learn. Res..
[7]
Joris Pelemans,et al.
Sparse Non-negative Matrix Language Modeling
,
2016,
Transactions of the Association for Computational Linguistics.
[8]
Thorsten Brants,et al.
One billion word benchmark for measuring progress in statistical language modeling
,
2013,
INTERSPEECH.
[9]
Nitish Srivastava,et al.
Dropout: a simple way to prevent neural networks from overfitting
,
2014,
J. Mach. Learn. Res..
[10]
Razvan Pascanu,et al.
Understanding the exploding gradient problem
,
2012,
ArXiv.
[11]
Yonghui Wu,et al.
Exploring the Limits of Language Modeling
,
2016,
ArXiv.
[12]
Slava M. Katz,et al.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
,
1987,
IEEE Trans. Acoust. Speech Signal Process..
[13]
Hermann Ney,et al.
Improved backing-off for M-gram language modeling
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.