Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning 10 50% of the pretrained parameters. This effectively leverages large pretrained models at low training cost such as wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation. This sets a new state-of-the-art for 36 translation directions (and surpassing cascaded ST for 26 of them) on the large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average for En-X directions and +6.7 BLEU for X-En directions). Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model (+5.6 BLEU on average across 28 non-English directions), making it an appealing approach for attaining high-quality speech translation with improved parameter and data efficiency.

[1]  Dmitriy Genzel,et al.  Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task , 2021, ACL.

[2]  Dmitriy Genzel,et al.  A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  J. Pino,et al.  Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq , 2020, AACL.

[4]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[5]  J. Pino,et al.  CoVoST 2 and Massively Multilingual Speech-to-Text Translation , 2020 .

[6]  Nadir Durrani,et al.  FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[7]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[8]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[9]  Matteo Negri,et al.  End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 , 2020, IWSLT.

[10]  Qiantong Xu,et al.  Self-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[11]  Sathish Reddy Indurthi,et al.  End-end Speech-to-Text Translation with Modality Agnostic Meta-Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Asa Cooper Stickland,et al.  Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation , 2020, EACL.

[13]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[14]  Zhenglu Yang,et al.  Curriculum Pre-training for End-to-End Speech Translation , 2020, ACL.

[15]  James R. Glass,et al.  Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[16]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Chris Dyer,et al.  Learning Robust and Multilingual Speech Representations , 2020, FINDINGS.

[18]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[19]  Hermann Ney,et al.  A Comparative Study on End-to-End Speech to Text Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  A. Sanchís,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jimmy J. Lin,et al.  What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning , 2019, ArXiv.

[22]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[23]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Matteo Negri,et al.  One-to-Many Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  M. Zhou,et al.  Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation , 2019, AAAI.

[27]  Jifeng Dai,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[28]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[29]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[30]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[31]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[32]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[33]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[34]  Jiajun Zhang,et al.  End-to-End Speech Translation with Knowledge Distillation , 2019, INTERSPEECH.

[35]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[37]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[38]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[39]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[42]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[43]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[44]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[46]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[47]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .