Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a 4.6% character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields 27% relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

[1]  Kuan-Yu Chen,et al.  Non-autoregressive Transformer-based End-to-end ASR using BERT , 2021, ArXiv.

[2]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[3]  Keqi Deng,et al.  Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning , 2021, Interspeech.

[4]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yonghong Yan,et al.  Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[7]  Ning Cheng,et al.  Applying wav2vec2.0 to Speech Recognition in various low-resource languages , 2020, ArXiv.

[8]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[9]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[11]  Yonghong Yan,et al.  History Utterance Embedding Transformer LM for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shiyu Zhou,et al.  Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition , 2021, IEEE Signal Processing Letters.

[13]  Tatsuya Kawahara,et al.  Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[14]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[15]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[16]  Gabriel Synnaeve,et al.  Joint Masked CPC and CTC Training for ASR , 2020, ArXiv.

[17]  Brian Kingsbury,et al.  Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[19]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[20]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[21]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[22]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[23]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[24]  Jianhua Tao,et al.  Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Xin Zhao,et al.  UER: An Open-Source Toolkit for Pre-training Models , 2019, EMNLP.

[26]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[28]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[29]  John R. Hershey,et al.  Joint CTC/attention decoding for end-to-end speech recognition , 2017, ACL.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[32]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[33]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36]  Wen Gao,et al.  Semantics of the Unwritten , 2020, ArXiv.

[37]  Yan Li,et al.  The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Christian Igel,et al.  On Scaling Contrastive Representations for Low-Resource Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Sining Sun,et al.  Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning , 2021, Interspeech.

[41]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .