Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition

Recurrent neural network language models (RNNLMs) are becoming increasingly popular for a range of applications including automatic speech recognition. An important issue that limits their possible application areas is the computational cost incurred in training and evaluation. This paper describes a series of new efficiency improving approaches that allows RNNLMs to be more efficiently trained on graphics processing units (GPUs) and evaluated on CPUs. First, a modified RNNLM architecture with a nonclass-based, full output layer structure (F-RNNLM) is proposed. This modified architecture facilitates a novel spliced sentence bunch mode parallelization of F-RNNLM training using large quantities of data on a GPU. Second, two efficient RNNLM training criteria based on variance regularization and noise contrastive estimation are explored to specifically reduce the computation associated with the RNNLM output layer softmax normalisation term. Finally, a pipelined training algorithm utilizing multiple GPUs is also used to further improve the training speed. Initially, RNNLMs were trained on a moderate dataset with 20M words from a large vocabulary conversational telephone speech recognition task. The training time of RNNLM is reduced by up to a factor of 53 on a single GPU over the standard CPU-based RNNLM toolkit. A 56 times speed up in test time evaluation on a CPU was obtained over the baseline F-RNNLMs. Consistent improvements in both recognition accuracy and perplexity were also obtained over C-RNNLMs. Experiments on Google's one billion corpus also reveals that the training of RNNLM scales well.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[5]  H. Schwenk,et al.  Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[6]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[8]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[9]  Ahmad Emami,et al.  Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[10]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[12]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[14]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[15]  Sanjeev Khudanpur,et al.  Variational approximation of long-span language models for lvcsr , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[17]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[18]  Dong Yu,et al.  Pipelined BackPropagation for Context-Dependent Deep Neural Networks , 2012 .

[19]  Petr Motlícek,et al.  Conversion of Recurrent Neural Network Language Models to Weighted Finite State Transducers for Automatic Speech Recognition , 2012, INTERSPEECH.

[20]  Dong Yu,et al.  Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[21]  Ebru Arisoy,et al.  Large Scale Hierarchical Neural Network Language Models , 2012, INTERSPEECH.

[22]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[23]  Geoffrey Zweig,et al.  Accelerating recurrent neural network training via two stage classes and parallelization , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Geoffrey Zweig,et al.  Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[25]  Geoffrey Zweig,et al.  Speed regularization and optimality in word classing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[27]  Yonghong Yan,et al.  Prefix tree based n-best list re-scoring for recurrent neural network language model used in speech recognition system , 2013, INTERSPEECH.

[28]  Ashish Vaswani,et al.  Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[29]  Yangyang Shi,et al.  Speed up of recurrent neural network language models with sentence independent subsampling stochastic gradient descent , 2013, INTERSPEECH.

[30]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[31]  Kenneth Ward Church,et al.  Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model , 2012, Speech Commun..

[32]  Alexandre Allauzen,et al.  Structured Output Layer Neural Network Language Models for Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[34]  Weiqiang Zhang,et al.  Temporal kernel neural network language model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Hermann Ney,et al.  Comparison of feedforward and recurrent neural network language models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[37]  Yongqiang Wang,et al.  Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[38]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[39]  Atsushi Nakamura,et al.  Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[41]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[42]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[43]  Meng Cai,et al.  Efficient One-Pass Decoding with NNLM for Speech Recognition , 2014, IEEE Signal Processing Letters.

[44]  Geoffrey Zweig,et al.  Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Dong Yu,et al.  On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Meng Cai,et al.  Variance regularization of RNNLM for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Mark J. F. Gales,et al.  Paraphrastic recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Mark J. F. Gales,et al.  Improving the training and evaluation efficiency of recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Mark J. F. Gales,et al.  Recurrent neural network language model training with noise contrastive estimation for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Tony Robinson,et al.  Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Ebru Arisoy,et al.  Unnormalized exponential and neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Mark J. F. Gales,et al.  The Cambridge University 2014 BOLT conversational telephone Mandarin Chinese LVCSR system for speech translation , 2015, INTERSPEECH.

[55]  Marie-Francine Moens,et al.  A survey on the application of recurrent neural networks to statistical language modeling , 2015, Comput. Speech Lang..

[56]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[57]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).