论文信息 - End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique, which trains only 20% of its total parameters. We show that adding an Adapter to the system and pre-training it, can increase the convergence speed and the final result, with which we achieve a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble that obtains 28.22 BLEU score on the same set. Our submission also uses a custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for identifying periods of untranscribable text and can bring improvements of 2.5 to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the given segmentation.

Carlos Escolano | Marta R. Costa-jussa | Ioannis Tsiamas | Gerard I. G'allego | Jos'e A. R. Fonollosa

[1] Yuqing Tang,et al. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[2] Nadir Durrani,et al. FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[3] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[4] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[5] Tomasz Potapczyk,et al. SRPOL’s System for the IWSLT 2020 End-to-End Speech Translation Task , 2020, IWSLT.

[6] Adam Lopez,et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[7] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[8] Hermann Ney,et al. On Using SpecAugment for End-to-End Speech Translation , 2019, IWSLT.

[10] Dmytro Okhonko,et al. fairseq S2T: Fast Speech-to-Text Modeling with fairseq , 2020, AACL.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Ankur Bapna,et al. Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[13] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[14] Alfons Juan-Císcar,et al. Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Sylvain Meignier,et al. LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[16] Matteo Negri,et al. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 , 2020, IWSLT.

[17] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[18] Mattia Antonino Di Gangi,et al. MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[19] Lucia Specia,et al. The IWSLT 2019 Evaluation Campaign , 2019, IWSLT.

[20] Gabriel Synnaeve,et al. Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Jan Niehues,et al. The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[22] Matthijs Douze,et al. Data Augmenting Contrastive Learning of Speech Representations in the Time Domain , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[23] Sathish Reddy Indurthi,et al. End-to-End Offline Speech Translation System for IWSLT 2020 using Modality Agnostic Meta-Learning , 2020, IWSLT.

[24] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[25] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Arya D. McCarthy,et al. Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade , 2019, IWSLT.

[28] Armand Joulin,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Mattia Antonino Di Gangi,et al. Data Augmentation for End-to-End Speech Translation: FBK@IWSLT ‘19 , 2019, IWSLT.

[31] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32] Alexander Waibel,et al. FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN , 2021, IWSLT.

[33] Yun Tang,et al. Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. , 2020 .

[34] Juan Pino,et al. CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus , 2020, ArXiv.

[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36] Matteo Negri,et al. Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.

[37] Olivier Pietquin,et al. End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).