End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique, which trains only 20% of its total parameters. We show that adding an Adapter to the system and pre-training it, can increase the convergence speed and the final result, with which we achieve a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble that obtains 28.22 BLEU score on the same set. Our submission also uses a custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for identifying periods of untranscribable text and can bring improvements of 2.5 to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the given segmentation.

[1]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[2]  Nadir Durrani,et al.  FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[3]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[4]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[5]  Tomasz Potapczyk,et al.  SRPOL’s System for the IWSLT 2020 End-to-End Speech Translation Task , 2020, IWSLT.

[6]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[7]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[8]  Hermann Ney,et al.  On Using SpecAugment for End-to-End Speech Translation , 2019, IWSLT.

[10]  Dmytro Okhonko,et al.  fairseq S2T: Fast Speech-to-Text Modeling with fairseq , 2020, AACL.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[13]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[14]  Alfons Juan-Císcar,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[16]  Matteo Negri,et al.  End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 , 2020, IWSLT.

[17]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[18]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[19]  Lucia Specia,et al.  The IWSLT 2019 Evaluation Campaign , 2019, IWSLT.

[20]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jan Niehues,et al.  The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[22]  Matthijs Douze,et al.  Data Augmenting Contrastive Learning of Speech Representations in the Time Domain , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[23]  Sathish Reddy Indurthi,et al.  End-to-End Offline Speech Translation System for IWSLT 2020 using Modality Agnostic Meta-Learning , 2020, IWSLT.

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Arya D. McCarthy,et al.  Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade , 2019, IWSLT.

[28]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Mattia Antonino Di Gangi,et al.  Data Augmentation for End-to-End Speech Translation: FBK@IWSLT ‘19 , 2019, IWSLT.

[31]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Alexander Waibel,et al.  FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN , 2021, IWSLT.

[33]  Yun Tang,et al.  Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. , 2020 .

[34]  Juan Pino,et al.  CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus , 2020, ArXiv.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Matteo Negri,et al.  Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.

[37]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).