论文信息 - E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

[1] Tara N. Sainath,et al. Turn-Taking Prediction for Natural Conversational Speech , 2022, INTERSPEECH.

[2] Tara N. Sainath,et al. Improving The Latency And Quality Of Cascaded Encoders , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Tara N. Sainath,et al. E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR , 2022, INTERSPEECH.

[4] Tara N. Sainath,et al. A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes , 2022, INTERSPEECH.

[5] M. Seltzer,et al. Streaming parallel transducer beam search with fast-slow cascaded encoders , 2022, INTERSPEECH.

[6] Rohit Prabhavalkar,et al. Dissecting User-Perceived Latency of On-Device E2E Speech Recognition , 2021, Interspeech.

[7] Tara N. Sainath,et al. Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Tara N. Sainath,et al. A Better and Faster end-to-end Model for Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Tara N. Sainath,et al. Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Tara N. Sainath,et al. RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[11] Lei Xie,et al. Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition , 2020, ArXiv.

[12] Tara N. Sainath,et al. Low Latency Speech Recognition Using End-to-End Prefetching , 2020, INTERSPEECH.

[13] K. Takeda,et al. End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Hagen Soltau,et al. Monotonic Recurrent Neural Network Transducer and Decoding Strategies , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15] Tara N. Sainath,et al. A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16] Tara N. Sainath,et al. Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] Tara N. Sainath,et al. Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[18] Arun Narayanan,et al. Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19] Hagen Soltau,et al. Reducing the computational complexity for whole word models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20] Tara N. Sainath,et al. Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[21] Juan Manuel Górriz,et al. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .