A Better and Faster end-to-end Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm we called Cascaded Encoders. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.

[1]  Jonathan Le Roux,et al.  Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Emitting Word Timings with End-to-End Models , 2020, INTERSPEECH.

[3]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[4]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[6]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Geoffrey Zweig,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[10]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[11]  Fabrizio Silvestri,et al.  Prefetching query results and its impact on search engines , 2012, SIGIR '12.

[12]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Chengyi Wang,et al.  Reducing the Latency of End-to-End Streaming Speech Recognition Models with a Scout Network , 2020 .

[15]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[16]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[17]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[18]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Fadi Biadsy,et al.  Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals , 2017, INTERSPEECH.

[21]  Gang Liu,et al.  An Online Attention-based Model for Speech Recognition , 2018, INTERSPEECH.

[22]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[23]  Chengyi Wang,et al.  Low Latency End-to-End Streaming Speech Recognition with a Scout Network , 2020, INTERSPEECH.

[24]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[25]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[26]  Yashesh Gaur,et al.  Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[29]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[30]  Yashesh Gaur,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[31]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[32]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  Tara N. Sainath,et al.  Low Latency Speech Recognition Using End-to-End Prefetching , 2020, INTERSPEECH.

[34]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[35]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[36]  Shinji Watanabe,et al.  Streaming Transformer ASR with Blockwise Synchronous Inference , 2020, ArXiv.

[37]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[38]  Tara N. Sainath,et al.  FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization , 2020, ArXiv.

[39]  Tara N. Sainath,et al.  A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).