Limited-Memory BFGS Optimization of Recurrent Neural Network Language Models for Speech Recognition

Recurrent neural network language models (RNNLM) have become an increasingly popular choice for state-of-the-art speech recognition systems. RNNLMs are normally trained by minimizing the cross entropy (CE) using the stochastic gradient descent (SGD) algorithm. The SGD method only uses first-order derivatives and no higher order gradient information is used to consider the correlation between model parameters. It is unable to fully capture the curvature of the error cost function. This can lead to slow convergence in model training. In this paper, a limited-memory Broyden Fletcher Goldfarb Shannon (L-BFGS) based second order optimization technique is proposed for RNNLMs. This method efficiently approximates the matrix-vector product between the inverse Hessian and gradient vector via a recursion over past gradients with a compact memory requirement. Consistent perplexity and error rate reductions are obtained over the SGD method on two speech recognition tasks: Switchboard English and Babel Cantonese. A faster convergence and speed up in RNNLM training time was also obtained. Index Terms: recurrent neural network, language model, second order optimization, limited-memory BFGS, speech recognition

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Roger Fletcher,et al.  Practical methods of optimization; (2nd ed.) , 1987 .

[3]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[4]  Nicholas I. M. Gould,et al.  Convergence of quasi-Newton matrices generated by the symmetric rank one update , 1991, Math. Program..

[5]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[10]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[11]  Ahmad Emami,et al.  Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[15]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[18]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[19]  Kenneth Ward Church,et al.  Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model , 2012, Speech Commun..

[20]  Alexandre Allauzen,et al.  Structured Output Layer Neural Network Language Models for Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Yongqiang Wang,et al.  Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[23]  Geoffrey Zweig,et al.  Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Yongqiang Wang,et al.  Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Yongqiang Wang,et al.  Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Tara N. Sainath,et al.  Parallel Deep Neural Network Training for Big Data on Blue Gene/Q , 2017, IEEE Transactions on Parallel and Distributed Systems.