JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.

[1]  B. Ramabhadran,et al.  Modular Hybrid Autoregressive Transducer , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[2]  Tara N. Sainath,et al.  JOIST: A Joint Speech and Text Streaming Model for ASR , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[3]  Tara N. Sainath,et al.  A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes , 2022, INTERSPEECH.

[4]  Michael Auli,et al.  Unified Speech-Text Pre-training for Speech Translation and Recognition , 2022, ACL.

[5]  H. Zen,et al.  MAESTRO: Matched Speech Text Representations through Modality Matching , 2022, INTERSPEECH.

[6]  Brian Kingsbury,et al.  Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ankur Bapna,et al.  SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training , 2021, ArXiv.

[8]  Xie Chen,et al.  Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition , 2021, INTERSPEECH.

[9]  Jinyu Li,et al.  Factorized Neural Transducer for Efficient Language Model Adaptation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jinyu Li,et al.  Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS , 2021, Interspeech.

[11]  Tara N. Sainath,et al.  An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling , 2021, Interspeech.

[12]  Tara N. Sainath,et al.  Tied & Reduced RNN-T Decoder , 2021, Interspeech.

[13]  Yonghong Yan,et al.  Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Naoyuki Kanda,et al.  Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition , 2021, Interspeech.

[15]  Janne Pylkkönen,et al.  Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network , 2021, Interspeech.

[16]  Hermann Ney,et al.  Librispeech Transducer Model with Internal Language Model Prior Correction , 2021, Interspeech.

[17]  Naoyuki Kanda,et al.  Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  D. Willett,et al.  Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Naoyuki Kanda,et al.  Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[20]  Tara N. Sainath,et al.  Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Zhong Meng,et al.  Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability , 2020, INTERSPEECH.

[22]  Hermann Ney,et al.  A New Training Pipeline for an Improved Neural Transducer , 2020, INTERSPEECH.

[23]  Xiaofeng Liu,et al.  Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ehsan Variani,et al.  A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  M. Seltzer,et al.  RNN-T For Latency Controlled ASR With Improved Beam Search , 2019, ArXiv.

[28]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Bhuvana Ramabhadran,et al.  Speech Recognition with Augmented Synthesized Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[31]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[32]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[35]  Alexander Gutkin,et al.  Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[36]  Naoyuki Kanda,et al.  Maximum a posteriori Based Decoding for CTC Acoustic Models , 2016, INTERSPEECH.

[37]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[38]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[39]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[40]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[41]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .