论文信息 - Real-Time One-Pass Decoder for Speech Recognition Using LSTM Language Models

Real-Time One-Pass Decoder for Speech Recognition Using LSTM Language Models

Recurrent Neural Networks, in particular Long-Short Term Memory (LSTM) networks, are widely used in Automatic Speech Recognition for language modelling during decoding, usually as a mechanism for rescoring hypothesis. This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables. These static tables were precomputed from a pruned n-gram model, reducing drastically the computational cost during decoding. Additionally, the LSTM language model evaluation was efficiently performed using Variance Regularization along with a strategy of lazy evaluation. The proposed one-pass decoder architecture was evaluated on the well-known LibriSpeech and TED-LIUMv3 datasets. Results showed that the proposed algorithm obtains very competitive WERs with ∼0.6 RTFs. Finally, our one-pass decoder is compared with a decoupled two-pass decoder.

[1] Meng Cai,et al. Efficient One-Pass Decoding with NNLM for Speech Recognition , 2014, IEEE Signal Processing Letters.

[2] Murat Saraclar,et al. On-the-fly lattice rescoring for real-time automatic speech recognition , 2010, INTERSPEECH.

[3] Hermann Ney,et al. Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[4] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[5] Lukás Burget,et al. Recurrent Neural Network Based Language Modeling in Meeting Recognition , 2011, INTERSPEECH.

[6] Geoffrey Zweig,et al. Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Yee Whye Teh,et al. A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[8] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[9] Atsushi Nakamura,et al. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Steve Young,et al. The HTK book , 1995 .

[11] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[12] Tomohiro Nakatani,et al. Rescoring N-Best Speech Recognition List Based on One-on-One Hypothesis Comparison Using Encoder-Classifier Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Dietrich Klakow,et al. Approximated and Domain-Adapted LSTM Language Models for First-Pass Decoding in Speech Recognition , 2017, INTERSPEECH.

[14] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[16] Mark J. F. Gales,et al. CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Hermann Ney,et al. A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] David Nolden,et al. Progress in Decoding for Large Vocabulary Continuous Speech Recognition , 2017 .

[19] Chiyoun Park,et al. Accelerating Recurrent Neural Network Language Model Based Online Speech Recognition System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Yiming Wang,et al. A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22] Yu Wang,et al. Future word contexts in neural network language models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23] Atsushi Nakamura,et al. Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Yonghong Yan,et al. Prefix tree based n-best list re-scoring for recurrent neural network language model used in speech recognition system , 2013, INTERSPEECH.

[25] H. Bourlard,et al. Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[26] Chiyoun Park,et al. Applying GPGPU to recurrent neural network language model based fast network search in the real-time LVCSR , 2015, INTERSPEECH.

[27] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Alfons Juan-Císcar,et al. The Translectures-UPV Toolkit , 2014, IberSPEECH.

[29] Ebru Arisoy,et al. Converting Neural Network Language Models into Back-off Language Models for Efficient Decoding in Automatic Speech Recognition , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Yannick Estève,et al. TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.