Neural candidate-aware language models for speech recognition

Abstract This paper presents novel neural network based language models that can correct automatic speech recognition (ASR) errors by using speech recognizer outputs as a context. Our proposed models, called neural candidate-aware language models (NCALMs), estimate the generative probability of a target sentence while considering ASR outputs including hypotheses and their posterior probabilities. Recently, neural network language models have achieved great success in ASR field because of their ability to learn long-range contexts and model the word representation in continuous space. However, they estimate a sentence probability without considering other candidates and their posterior probabilities, even though the competing hypotheses are available and include important information to increase the speech recognition accuracy. To overcome this limitation, our idea is to utilize ASR outputs in both the training phase and the inference phase. Our proposed models are conditional generative models consisting of a Transformer encoder and a Transformer decoder. The encoder embeds the candidates as context vectors and the decoder estimates a sentence probability given the context vectors. We evaluate the proposed models in Japanese lecture transcription and English conversational speech recognition tasks. Experimental results show that a NCALM has better ASR performance than a system including a deep neural network-hidden Markov model hybrid system. We further improve ASR performance by using a NCALM and a Transformer language model simultaneously.

[1]  Hermann Ney,et al.  LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring , 2019, ArXiv.

[2]  Shiliang Zhang,et al.  Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition , 2019, INTERSPEECH.

[3]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[6]  Akinori Ito,et al.  Round-Robin Duel Discriminative Language Models , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[8]  Tomohiro Tanaka,et al.  Neural Error Corrective Language Models for Automatic Speech Recognition , 2018, INTERSPEECH.

[9]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[10]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[12]  Ryuichiro Higashinaka,et al.  Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  S. Furui,et al.  A JAPANESE NATIONAL PROJECT ON SPONTANEOUS SPEECH CORPUS AND PROCESSING TECHNOLOGY , 2003 .

[15]  Mingjing Li,et al.  Discriminative training on language model , 2000, INTERSPEECH.

[16]  Yuuki Tachioka,et al.  Discriminative method for recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[19]  Geoffrey Zweig,et al.  Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yangyang Shi,et al.  Towards Recurrent Neural Networks Language Models with Linguistic and Contextual Features , 2012, INTERSPEECH.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[24]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[25]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[26]  Brian Roark,et al.  Continuous space discriminative language modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[28]  Lukás Burget,et al.  Recurrent Neural Network Based Language Modeling in Meeting Recognition , 2011, INTERSPEECH.

[29]  John R. Hershey,et al.  Minimum word error training of long short-term memory recurrent neural network language models for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[32]  Hermann Ney,et al.  Bag-of-words input for long history representation in neural network-based language models for speech recognition , 2015, INTERSPEECH.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[35]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[36]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[37]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[38]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..