Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwiseattention and connectionist temporal classification with maskpredict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at https://github.com/espnet/espnet.

[1]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[2]  Jindrich Libovický,et al.  End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[3]  Shinji Watanabe,et al.  Insertion-Based Modeling for End-to-End Automatic Speech Recognition , 2020, INTERSPEECH.

[4]  Qian Zhang,et al.  Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition , 2020, ArXiv.

[5]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Katrin Kirchhoff,et al.  Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment , 2020, NAACL.

[9]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10]  Kevin Duh,et al.  ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[12]  Navdeep Jaitly,et al.  Imputer: Sequence Modelling via Imputation and Dynamic Programming , 2020, ICML.

[13]  Gil Keren,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[14]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tatsuya Kawahara,et al.  Enhancing Monotonic Multihead Attention for Streaming ASR , 2020, INTERSPEECH.

[16]  Shinji Watanabe,et al.  Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[18]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[19]  Tetsunori Kobayashi,et al.  Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict , 2020, INTERSPEECH.

[20]  Wei Chu,et al.  CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition , 2020, ArXiv.

[21]  Tetsuji Ogawa,et al.  Improved Mask-CTC for Non-Autoregressive End-to-End ASR , 2020, ArXiv.

[22]  Shinji Watanabe,et al.  Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition , 2019, ArXiv.

[23]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[24]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[25]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[26]  Jonathan Le Roux,et al.  Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Kjell Schubert,et al.  RNN-T For Latency Controlled ASR With Improved Beam Search , 2019, ArXiv.

[29]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[30]  Shinji Watanabe,et al.  Streaming Transformer Asr With Blockwise Synchronous Beam Search , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[31]  Shuai Zhang,et al.  Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[32]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[33]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[34]  Yonghong Yan,et al.  Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tara N. Sainath,et al.  A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).