论文信息 - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwiseattention and connectionist temporal classification with maskpredict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at https://github.com/espnet/espnet.

[1] Victor O. K. Li,et al. Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[2] Jindrich Libovický,et al. End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[3] Shinji Watanabe,et al. Insertion-Based Modeling for End-to-End Automatic Speech Recognition , 2020, INTERSPEECH.

[4] Qian Zhang,et al. Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition , 2020, ArXiv.

[5] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[6] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[7] Hairong Liu,et al. Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Katrin Kirchhoff,et al. Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment , 2020, NAACL.

[9] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10] Kevin Duh,et al. ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Omer Levy,et al. Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[12] Navdeep Jaitly,et al. Imputer: Sequence Modelling via Imputation and Dynamic Programming , 2020, ICML.

[13] Gil Keren,et al. Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[14] Tara N. Sainath,et al. Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Tatsuya Kawahara,et al. Enhancing Monotonic Multihead Attention for Streaming ASR , 2020, INTERSPEECH.

[16] Shinji Watanabe,et al. Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[18] Colin Raffel,et al. Monotonic Chunkwise Attention , 2017, ICLR.

[19] Tetsunori Kobayashi,et al. Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict , 2020, INTERSPEECH.

[20] Wei Chu,et al. CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition , 2020, ArXiv.

[21] Tetsuji Ogawa,et al. Improved Mask-CTC for Non-Autoregressive End-to-End ASR , 2020, ArXiv.

[22] Shinji Watanabe,et al. Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition , 2019, ArXiv.

[23] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[24] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[25] Yoshua Bengio,et al. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[26] Jonathan Le Roux,et al. Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Kjell Schubert,et al. RNN-T For Latency Controlled ASR With Improved Beam Search , 2019, ArXiv.

[29] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[30] Shinji Watanabe,et al. Streaming Transformer Asr With Blockwise Synchronous Beam Search , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[31] Shuai Zhang,et al. Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[32] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[33] Paul Deléglise,et al. Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[34] Yonghong Yan,et al. Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Tara N. Sainath,et al. A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).