Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition

Language modeling is a crucial component in a wide range of applications including speech recognition. Language models (LMs) are usually constructed by splitting a sentence into words and computing the probability of a word based on its word history. This sentence probability calculation, making use of conditional probability distributions, assumes that there is little impact from approximations used in the LMs, including the word history representations and finite training data. This motivates examining models that make use of additional information from the sentence. In this paper, future word information, in addition to the history, is used to predict the probability of the current word. For recurrent neural network LMs (RNNLMs), this information can be encapsulated in a bi-directional model. However, if used directly, this form of model is computationally expensive when trained on large quantities of data, and can be problematic when used with word lattices. This paper proposes a novel neural network language model structure, the succeeding-word RNNLM, su-RNNLM, to address these issues. Instead of using a recurrent unit to capture the complete future word contexts, a feedforward unit is used to model a fixed finite number of succeeding words. This is more efficient in training than bi-directional models and can be applied to lattice rescoring. The generated lattices can be used for downstream applications, such as confusion network decoding and keyword search. Experimental results on speech recognition and keyword spotting tasks illustrate the empirical usefulness of future word information, and the flexibility of the proposed model to represent this information.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Yangyang Shi,et al.  Exploiting the succeeding words in recurrent neural network language models , 2013, INTERSPEECH.

[3]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[4]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[5]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[6]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[7]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[8]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[9]  Yongqiang Wang,et al.  Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[11]  Yu Wang,et al.  PHONETIC AND GRAPHEMIC SYSTEMS FOR MULTI-GENRE BROADCAST TRANSCRIPTION , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yu Wang,et al.  Future word contexts in neural network language models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[15]  Ebru Arisoy,et al.  Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[17]  Mark J. F. Gales,et al.  Investigation of back-off based interpolation between recurrent neural network and n-gram language models , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Mark J. F. Gales,et al.  Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages , 2015, INTERSPEECH.

[19]  Yongqiang Wang,et al.  Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[21]  Geoffrey Zweig,et al.  Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[22]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[23]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[24]  Mark J. F. Gales,et al.  An initial investigation of long-term adaptation for meeting transcription , 2014, INTERSPEECH.

[25]  Hermann Ney,et al.  Performance analysis of Neural Networks in combination with n-gram language models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Mark J. F. Gales,et al.  Improving the training and evaluation efficiency of recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[29]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[30]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[31]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[32]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[33]  Bin Wang,et al.  Learning Trans-Dimensional Random Fields with Applications to Language Modeling , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[35]  Hermann Ney,et al.  Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[36]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Frederick Jelinek The Dawn of Statistical ASR and MT , 2009, Computational Linguistics.

[38]  Mark J. F. Gales,et al.  Recurrent neural network language models for keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .