Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems.

[1]  Jonathan Le Roux,et al.  Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jonathan Le Roux,et al.  Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Gang Liu,et al.  An Online Attention-based Model for Speech Recognition , 2018, INTERSPEECH.

[5]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Lei Xie,et al.  Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition , 2020, INTERSPEECH.

[10]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[11]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Improving the Performance of Online Neural Transducer Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shiliang Zhang,et al.  Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Shiliang Zhang,et al.  Deep-FSMN for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[16]  Yonghong Yan,et al.  Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[17]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[18]  Ian McLoughlin,et al.  SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition , 2020, INTERSPEECH.

[19]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Yu Hu,et al.  Feedforward Sequential Memory Networks: A New Structure to Learn Long-term Dependency , 2015, ArXiv.

[23]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[24]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[25]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.