Evaluation Metrics For Language Models

The most widely-used evaluation metric for language models for speech recognition is the perplexity of test data. While perplexities can be calculated efficiently and without access to a speech recognizer, they often do not correlate well with speech recognition word-error rates. In this research, we attempt to find a measure that like perplexity is easily calculated but which better predicts speech recognition performance. We investigate two approaches; first, we attempt to extend perplexity by using similar measures that utilize information about language models that perplexity ignores. Second, we attempt to imitate the word-error calculation without using a speech recognizer by artificially generating speech recognition lattices. To test our new metrics, we have built over thirty varied language models. We find that perplexity correlates with word-error rate remarkably well when only considering n-gram models trained on in-domain data. When considering other types of models, our novel metrics are superior to perplexity for predicting speech recognition performance. However, we conclude that none of these measures predict word-error rate sufficiently accurately to be effective tools for language model evaluation in speech recognition.

[1]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[2]  John Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, Annual Meeting of the Association for Computational Linguistics.

[3]  Hermann Ney,et al.  Adaptive topic - dependent language modelling using word - based varigrams , 1997, EUROSPEECH.

[4]  Alex Waibel,et al.  The Janus Speech Recognizer , 1995 .

[5]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Mari Ostendorf,et al.  Analyzing and predicting language model improvements , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[8]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.