Efficient One-Pass Decoding with NNLM for Speech Recognition

Neural network language model (NNLM) has achieved very good results in the field of speech recognition, machine translation, etc. Direct decoding with NNLM is challenging for the overwhelmingly heavy burden in complexity. Most of the previous work focused on rescoring the N-best list and lattice with NNLM in the second pass. In this work, several techniques are explored to directly incorporate the NNLM into the decoder of speech recognition. A novel training algorithm based on variance regularization is proposed to approximate the softmax-normalizing factor as a constant for fast evaluation. Also, the evaluation of NNLM is further speeded up via our advanced storage. Moreover, a simple cache-based strategy is explored to avoid redundant computations during the decoding process. To the authors' knowledge, it is the first time to directly incorporate NNLM into decoding. We evaluate our proposed methods on an English-Switchboard phone-call speech-to-text task. Experimental results show that incorporating the NNLM into the decoder significantly reduces the word error rate (WER) by 1.5% and 1.4% absolutely on the Hub5'00-SWB and RT03S-FSH sets, respectively. Also, the decoding with NNLM is twice as fast as the baseline at the same word error rate.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Ebru Arisoy,et al.  Converting Neural Network Language Models into back-off language models for efficient decoding in automatic speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Salvador España Boquera,et al.  Fast Evaluation of Connectionist Language Models , 2009, IWANN.

[5]  SHAN Yu-Xiang,et al.  Fast Language Model Lookahead Algorithm Using Extended N-gram Model , 2012 .

[6]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[8]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[9]  Alexandre Allauzen,et al.  Structured Output Layer Neural Network Language Models for Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  单煜翔,et al.  基于扩展 N 元文法模型的快速语言模型预测算法 , 2012 .

[11]  Petr Motlícek,et al.  Conversion of Recurrent Neural Network Language Models to Weighted Finite State Transducers for Automatic Speech Recognition , 2012, INTERSPEECH.

[12]  Kenneth Ward Church,et al.  Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model , 2012, Speech Commun..