Improvements to N-gram Language Model Using Text Generated from Neural Language Model

Although neural language models have emerged, n-gram language models are still used for many speech recognition tasks. This paper proposes four methods to improve n-gram language models using text generated from a recurrent neural network language model (RNNLM). First, we use multiple RNNLMs from different domains instead of a single RNNLM. The final n-gram language model is obtained by interpolating generated n-gram models from each domain. Second, we use subwords instead of words for RNNLM to reduce the out-of-vocabulary rate. Third, we generate text templates using an RNNLM for template-based data augmentation for named entities. Fourth, we use both forward RNNLM and backward RNNLM to generate text. We found that these four methods improved performance of speech recognition up to 4% relative in various tasks.

[1]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[2]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[3]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[4]  Bhuvana Ramabhadran,et al.  Language modeling with highway LSTM , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[6]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[8]  Ebru Arisoy,et al.  Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Atsushi Nakamura,et al.  Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ebru Arisoy,et al.  Converting Neural Network Language Models into back-off language models for efficient decoding in automatic speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[16]  Oh-Wook Kwon,et al.  Korean large vocabulary continuous speech recognition with morpheme-based recognition units , 2003, Speech Commun..

[17]  Vladlen Koltun,et al.  Trellis Networks for Sequence Modeling , 2018, ICLR.

[18]  Bhuvana Ramabhadran,et al.  Fast Neural Network Language Model Lookups at N-Gram Speeds , 2017, INTERSPEECH.

[19]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[20]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[22]  Hai Zhao,et al.  Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation , 2013, EMNLP.

[23]  M. Ostendorf,et al.  Using out-of-domain data to improve in-domain language models , 1997, IEEE Signal Processing Letters.

[24]  Sylvain Meignier,et al.  Named Entity Recognition in Speech Transcripts following an Extended Taxonomy , 2013, SLAM@INTERSPEECH.

[25]  Akinori Ito,et al.  Combinations of various language model technologies including data expansion and adaptation in spontaneous speech recognition , 2015, INTERSPEECH.

[26]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Ngoc Thang Vu,et al.  Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding , 2014, INTERSPEECH.

[28]  Olivier Galibert,et al.  Investigating the Effect of ASR Tuning on Named Entity Recognition , 2017, INTERSPEECH.

[29]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[30]  Ziang Xie,et al.  Neural Text Generation: A Practical Guide , 2017, ArXiv.

[31]  Lori Lamel,et al.  An investigation into language model data augmentation for low-resourced STT and KWS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Akinori Ito,et al.  Domain Adaptation Based on Mixture of Latent Words Language Models for Automatic Speech Recognition , 2018, IEICE Trans. Inf. Syst..

[33]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[34]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).