论文信息 - Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a 4.6% character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields 27% relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

[1] Kuan-Yu Chen,et al. Non-autoregressive Transformer-based End-to-end ASR using BERT , 2021, ArXiv.

[2] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[3] Keqi Deng,et al. Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning , 2021, Interspeech.

[4] Gabriel Synnaeve,et al. Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Yonghong Yan,et al. Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[7] Ning Cheng,et al. Applying wav2vec2.0 to Speech Recognition in various low-resource languages , 2020, ArXiv.

[8] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[9] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[11] Yonghong Yan,et al. History Utterance Embedding Transformer LM for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Shiyu Zhou,et al. Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition , 2021, IEEE Signal Processing Letters.

[13] Tatsuya Kawahara,et al. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[14] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[15] Hui Bu,et al. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[16] Gabriel Synnaeve,et al. Joint Masked CPC and CTC Training for ASR , 2020, ArXiv.

[17] Brian Kingsbury,et al. Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[19] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[20] Quoc V. Le,et al. Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[21] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[22] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[23] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[24] Jianhua Tao,et al. Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Xin Zhao,et al. UER: An Open-Source Toolkit for Pre-training Models , 2019, EMNLP.

[26] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[28] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[29] John R. Hershey,et al. Joint CTC/attention decoding for end-to-end speech recognition , 2017, ACL.

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[32] Adam Coates,et al. Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[33] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36] Wen Gao,et al. Semantics of the Unwritten , 2020, ArXiv.

[37] Yan Li,et al. The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Christian Igel,et al. On Scaling Contrastive Representations for Low-Resource Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40] Sining Sun,et al. Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning , 2021, Interspeech.

[41] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .