论文信息 - Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition

Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition

Recurrent neural network language models (RNNLMs) are becoming increasingly popular for a range of applications including automatic speech recognition. An important issue that limits their possible application areas is the computational cost incurred in training and evaluation. This paper describes a series of new efficiency improving approaches that allows RNNLMs to be more efficiently trained on graphics processing units (GPUs) and evaluated on CPUs. First, a modified RNNLM architecture with a nonclass-based, full output layer structure (F-RNNLM) is proposed. This modified architecture facilitates a novel spliced sentence bunch mode parallelization of F-RNNLM training using large quantities of data on a GPU. Second, two efficient RNNLM training criteria based on variance regularization and noise contrastive estimation are explored to specifically reduce the computation associated with the RNNLM output layer softmax normalisation term. Finally, a pipelined training algorithm utilizing multiple GPUs is also used to further improve the training speed. Initially, RNNLMs were trained on a moderate dataset with 20M words from a large vocabulary conversational telephone speech recognition task. The training time of RNNLM is reduced by up to a factor of 53 on a single GPU over the standard CPU-based RNNLM toolkit. A 56 times speed up in test time evaluation on a CPU was obtained over the baseline F-RNNLMs. Consistent improvements in both recognition accuracy and perplexity were also obtained over C-RNNLMs. Experiments on Google's one billion corpus also reveals that the training of RNNLM scales well.

[1] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[2] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4] Andreas Stolcke,et al. Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[5] H. Schwenk,et al. Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[6] Mark J. F. Gales,et al. Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7] Yoshua Bengio,et al. Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[8] Holger Schwenk,et al. Continuous space language models , 2007, Comput. Speech Lang..

[9] Ahmad Emami,et al. Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[10] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[11] Mark J. F. Gales,et al. Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[12] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[14] Tomas Mikolov,et al. RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[15] Sanjeev Khudanpur,et al. Variational approximation of long-span language models for lvcsr , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Aapo Hyvärinen,et al. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[17] Yee Whye Teh,et al. A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[18] Dong Yu,et al. Pipelined BackPropagation for Context-Dependent Deep Neural Networks , 2012 .

[19] Petr Motlícek,et al. Conversion of Recurrent Neural Network Language Models to Weighted Finite State Transducers for Automatic Speech Recognition , 2012, INTERSPEECH.

[20] Dong Yu,et al. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[21] Ebru Arisoy,et al. Large Scale Hierarchical Neural Network Language Models , 2012, INTERSPEECH.

[22] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[23] Geoffrey Zweig,et al. Accelerating recurrent neural network training via two stage classes and parallelization , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24] Geoffrey Zweig,et al. Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[25] Geoffrey Zweig,et al. Speed regularization and optimality in word classing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26] Geoffrey Zweig,et al. Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[27] Yonghong Yan,et al. Prefix tree based n-best list re-scoring for recurrent neural network language model used in speech recognition system , 2013, INTERSPEECH.

[28] Ashish Vaswani,et al. Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[29] Yangyang Shi,et al. Speed up of recurrent neural network language models with sentence independent subsampling stochastic gradient descent , 2013, INTERSPEECH.

[30] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[31] Kenneth Ward Church,et al. Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model , 2012, Speech Commun..

[32] Alexandre Allauzen,et al. Structured Output Layer Neural Network Language Models for Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[33] Yoshua Bengio,et al. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[34] Weiqiang Zhang,et al. Temporal kernel neural network language model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35] Hermann Ney,et al. Comparison of feedforward and recurrent neural network language models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36] Richard M. Schwartz,et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[37] Yongqiang Wang,et al. Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[38] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[39] Atsushi Nakamura,et al. Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Geoffrey Zweig,et al. An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[41] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[42] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[43] Meng Cai,et al. Efficient One-Pass Decoding with NNLM for Speech Recognition , 2014, IEEE Signal Processing Letters.

[44] Geoffrey Zweig,et al. Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Dong Yu,et al. On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Meng Cai,et al. Variance regularization of RNNLM for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Yongqiang Wang,et al. Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] Mark J. F. Gales,et al. Paraphrastic recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Mark J. F. Gales,et al. Improving the training and evaluation efficiency of recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50] Mark J. F. Gales,et al. Recurrent neural network language model training with noise contrastive estimation for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Hermann Ney,et al. From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52] Tony Robinson,et al. Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Ebru Arisoy,et al. Unnormalized exponential and neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54] Mark J. F. Gales,et al. The Cambridge University 2014 BOLT conversational telephone Mandarin Chinese LVCSR system for speech translation , 2015, INTERSPEECH.

[55] Marie-Francine Moens,et al. A survey on the application of recurrent neural networks to statistical language modeling , 2015, Comput. Speech Lang..

[56] Mark J. F. Gales,et al. Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[57] Mark J. F. Gales,et al. CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).