Sparse Non-Negative Matrix Language Modeling: Maximum Entropy Flexibility on the Cheap

We present a new method for estimating the sparse non-negative model (SNM) by using a small amount of held-out data and the multinomial loss that is natural for language modeling; we validate it experimentally against the previous estimation method which uses leave-one-out on training data and a binary loss function and show that it performs equally well. Being able to train on held-out data is very important in practical situations where training data is mismatched from held-out/test data. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model, which is the only model component estimated using gradient descent; the bulk of model parameters are relative frequencies counted on training data. A second contribution is a comparison between SNM and the related class of Maximum Entropy language models. While much cheaper computationally, we show that SNM achieves slightly better perplexity results for the same feature set and same speech recognition accuracy on voice search and short message dictation.

[1]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[2]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[3]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[4]  Joris Pelemans,et al.  Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation , 2014, ArXiv.

[5]  Noam Shazeer,et al.  Sparse non-negative matrix language modeling for geo-annotated query session data , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Mark Dredze,et al.  Small Statistical Models by Random Feature Mixing , 2008, ACL 2008.

[7]  Brian Roark,et al.  Backoff inspired features for maximum entropy language models , 2014, INTERSPEECH.

[8]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[9]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[10]  Joris Pelemans,et al.  Sparse non-negative matrix language modeling for skip-grams , 2015, INTERSPEECH.

[11]  Gideon S. Mann,et al.  MapReduce/Bigtable for Distributed Optimization , 2010 .

[12]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[13]  Fernando Pereira,et al.  Multinomial Loss on Held-out Data for the Sparse Non-negative Matrix Language Model , 2015, ArXiv.

[14]  Thorsten Brants,et al.  Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.