On-Line Language Model Biasing for Multi-Pass Automatic Speech Recognition

The language model (LM) is a critical component in statistical automatic speech recognition (ASR) systems, serving to establish a probability distribution over the hypothesis space. In typical use, the LM is trained off-line and remains static at run-time. While cache LMs, dialogue/style adaptation, and information retrieval-based biasing provide some ability for modifying the LM at run-time, they are limited in scope, susceptible to recognition error, place restrictions on the training data and/or test sets, or cannot be implemented for on-line, interactive systems. In this paper, we describe a novel LM biasing method suitable for multi-pass ASR systems. We use k-best lists from the initial recognition pass to obtain a confidence-weighted biasing of the LM training corpus. The latter is used to train a LM biased to the test input. The biased LM is used in the second pass to obtain refined hypotheses either by re-decoding or by re-ranking the k-best list. We sketch an on-line implementation of this scheme that lends itself to integration within low-latency systems. The proposed method is robust to recognition error, and operates on individual utterances without the need for dialogue context. The biased LMs provide significant reduction in perplexity and consistent improvement in word error rate (WER) over unbiased, state-of-the-art, large-vocabulary baseline ASR systems. On the Farsi and English test sets, we obtained relative reductions in perplexity of 24.5% and 31.6%, respectively. Additionally, relative reductions of 1.6% and 1.8% in WER were obtained for large-vocabulary Farsi and English ASR, respectively.

[1]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[2]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Stavros Tsakalidis,et al.  Incremental dialog clustering for speech-to-speech translation , 2009, INTERSPEECH.

[6]  Jean-Luc Gauvain,et al.  Unsupervised language model adaptation for broadcast news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[8]  Richard M. Schwartz,et al.  Efficient 2-pass n-best decoder , 1997, EUROSPEECH.

[9]  Jean-Luc Gauvain,et al.  Using information retrieval methods for language model adaptation , 2001, INTERSPEECH.

[10]  Amit Srivastava,et al.  Online speaker adaptation and tracking for real-time speech recognition , 2005, INTERSPEECH.