E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

[1]  Tara N. Sainath,et al.  Turn-Taking Prediction for Natural Conversational Speech , 2022, INTERSPEECH.

[2]  Tara N. Sainath,et al.  Improving The Latency And Quality Of Cascaded Encoders , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tara N. Sainath,et al.  E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR , 2022, INTERSPEECH.

[4]  Tara N. Sainath,et al.  A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes , 2022, INTERSPEECH.

[5]  M. Seltzer,et al.  Streaming parallel transducer beam search with fast-slow cascaded encoders , 2022, INTERSPEECH.

[6]  Rohit Prabhavalkar,et al.  Dissecting User-Perceived Latency of On-Device E2E Speech Recognition , 2021, Interspeech.

[7]  Tara N. Sainath,et al.  Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tara N. Sainath,et al.  A Better and Faster end-to-end Model for Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[11]  Lei Xie,et al.  Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition , 2020, ArXiv.

[12]  Tara N. Sainath,et al.  Low Latency Speech Recognition Using End-to-End Prefetching , 2020, INTERSPEECH.

[13]  K. Takeda,et al.  End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hagen Soltau,et al.  Monotonic Recurrent Neural Network Transducer and Decoding Strategies , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Tara N. Sainath,et al.  A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[18]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19]  Hagen Soltau,et al.  Reducing the computational complexity for whole word models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Tara N. Sainath,et al.  Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[21]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .