Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation. A first set of experiments empirically evaluating it on the On e Billion Word Benchmark [Chelba et al., 2013] shows that SNM n-gram LMs perform almost as well as the well-established Kneser-Ney (KN) models. When using skip-gram features the models are able to match the state-of-the-art recurrent neural network (RNN) LMs; combining the two modeling techniques yields the best known result on the benchmark. The computational advantages of SNM over both maximum entropy and RNN LM estimation are probably its main strength, promising an approach that has the same flexibility in combining arbitrary feature s effectively and yet should scale to very large amounts of data as gracefully as n-gram LMs do.

[1]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[5]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[9]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[10]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[11]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[13]  Ahmad Emami,et al.  A Neural Syntactic Language Model , 2005, Machine Learning.

[14]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[15]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[16]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[17]  Lei Zheng,et al.  A Scalable Distributed Syntactic, Semantic, and Lexical Language Model , 2012, CL.

[18]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[19]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[20]  Dietrich Klakow,et al.  Comparing RNNs and log-linear interpolation of improved skip-model on four Babel languages: Cantonese, Pashto, Tagalog, Turkish , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[22]  Tony Robinson,et al.  Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).