Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system.

[1]  Yifan Gong,et al.  Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling , 2020, ArXiv.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[6]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[7]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[8]  Tatsuya Kawahara,et al.  Enhancing Monotonic Multihead Attention for Streaming ASR , 2020, INTERSPEECH.

[9]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[10]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[11]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[12]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[13]  Qian Zhang,et al.  Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition , 2020, ArXiv.

[14]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[16]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[17]  Lei Xie,et al.  Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition , 2020, INTERSPEECH.

[18]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[19]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[23]  Jiangyan Yi,et al.  Synchronous Transformers for end-to-end Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.