论文信息 - Speed regularization and optimality in word classing

Speed regularization and optimality in word classing

Word-classing has been used in language modeling for two distinct purposes: to improve the likelihood of the language model, and to improve the runtime speed. In particular, frequency-based heuristics have been proposed to improve the speed of recurrent neural network language models (RNN-LMs). In this paper, we present a dynamic programming algorithm for determining classes in a way that provably minimizes the runtime of the resulting class-based language models. However, we also find that the speed-based methods degrade the perplexity of the language models by 5-10% over traditional likelihood-based classing. We remedy this via the introduction of a speed-based regularization term in the likelihood objective function. This achieves a runtime close to that of the speed based methods without loss in perplexity performance. We demonstrate these improvements with both an RNN-LM and the Model M exponential language model, for three different tasks involving two different languages.

Geoffrey Zweig | Konstantin Makarychev | G. Zweig | K. Makarychev

[1] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[2] Hermann Ney,et al. Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[3] Hermann Ney,et al. Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[4] Frederick Jelinek,et al. Improved clustering techniques for class-based statistical language modeling , 1999 .

[5] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6] Jun Wu,et al. Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[7] Joshua Goodman,et al. Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8] Jean-Luc Gauvain,et al. Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Yoshua Bengio,et al. Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[10] Holger Schwenk,et al. Continuous space language models , 2007, Comput. Speech Lang..

[11] Geoffrey E. Hinton,et al. A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[12] Stanley F. Chen,et al. Performance Prediction for Exponential Language Models , 2009, NAACL.

[13] Stanley F. Chen,et al. Shrinking Exponential Language Models , 2009, NAACL.

[14] Bhuvana Ramabhadran,et al. Scaling shrinkage-based language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15] Stanley F. Chen,et al. Enhanced word classing for model M , 2010, INTERSPEECH.

[16] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[17] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Alexandre Allauzen,et al. Structured Output Layer neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Geoffrey Zweig,et al. Personalizing Model M for Voice-Search , 2011, INTERSPEECH.

[20] Lukás Burget,et al. Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21] Geoffrey Zweig,et al. Computational Approaches to Sentence Completion , 2012, ACL.