Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

[1]  Linhao Dong,et al.  A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition , 2020, ArXiv.

[2]  Pascale Fung,et al.  HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[3]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[8]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[9]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[10]  Abdel-rahman Mohamed,et al.  Effectiveness of Self-Supervised Pre-Training for ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Bo Xu,et al.  Ectc-Docd: An End-to-End Structure with CTC Encoder and OCD Decoder for Speech Recognition , 2019, INTERSPEECH.

[12]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[13]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[14]  Yun Tang,et al.  Cross-Modal Transfer Learning for Multilingual Speech-to-Text Translation , 2020, ArXiv.

[15]  Xiangang Li,et al.  Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[16]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[17]  Shinji Watanabe,et al.  Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition , 2019, ArXiv.

[18]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Geoffrey E. Hinton,et al.  A Better Way to Pretrain Deep Boltzmann Machines , 2012, NIPS.

[20]  Dong Yu,et al.  Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Shuang Xu,et al.  Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages , 2018, ArXiv.

[22]  Shuang Xu,et al.  Multilingual Recurrent Neural Networks with Residual Learning for Low-Resource Speech Recognition , 2017, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[24]  Yan Li,et al.  The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27]  Tara N. Sainath,et al.  A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[28]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[29]  James R. Glass,et al.  Supervised and Unsupervised Transfer Learning for Question Answering , 2017, NAACL.

[30]  Ning Cheng,et al.  Applying wav2vec2.0 to Speech Recognition in various low-resource languages , 2020, ArXiv.

[31]  Tetsunori Kobayashi,et al.  Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict , 2020, INTERSPEECH.

[32]  Shuang Xu,et al.  A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[33]  Lin-Shan Lee,et al.  Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Ronan Collobert,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Ramón Fernández Astudillo,et al.  Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[36]  Weiran Wang,et al.  Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.