An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model [3], we explore using a Hybrid Autoregressive Transducer (HAT) [4] factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rareword quality, as well as latency, and is 318X smaller.

[1]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[2]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[3]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[4]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Brian Roark,et al.  Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[7]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[8]  Alexander Gutkin,et al.  Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Fadi Biadsy,et al.  Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals , 2017, INTERSPEECH.

[12]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[13]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[14]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[15]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tom Bagby,et al.  End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow , 2017, INTERSPEECH.

[17]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[19]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[21]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[22]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[23]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[26]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tara N. Sainath,et al.  Low Latency Speech Recognition Using End-to-End Prefetching , 2020, INTERSPEECH.

[29]  Jinyu Li,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[30]  Tara N. Sainath,et al.  Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus , 2020, INTERSPEECH.

[31]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Tara N. Sainath,et al.  A Better and Faster end-to-end Model for Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Tara N. Sainath,et al.  Tied & Reduced RNN-T Decoder , 2021, Interspeech.

[34]  Tara N. Sainath,et al.  Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tara N. Sainath,et al.  FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).