Multilingual Speech Translation with Efficient Finetuning of Pretrained Models.

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve crosslingual and cross-modality transfer ability with only finetuning less than 10% of the pretrained parameters. This enables effectively leveraging large pretrained models with low training cost. Using wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation, our approach advanced the new state-of-the-art for 32 translation directions (and surpassing cascaded ST for 19 of them) on the multilingual ST benchmark CoVoST 2 (+4.4 BLEU on average for 13 En-X directions and +4.3 BLEU on average for 19 X-En directions). Our approach demonstrates strong performance in both zero-shot crosslingual transfer and zero-shot pair-wise translation in a many-to-many multilingual model, making it an appealing approach for attaining high-quality speech translation with low training cost and improved data efficiency.

[1]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[4]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[5]  J. Pino,et al.  CoVoST 2 and Massively Multilingual Speech-to-Text Translation , 2020 .

[6]  Chris Dyer,et al.  Learning Robust and Multilingual Speech Representations , 2020, FINDINGS.

[7]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[8]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[10]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[12]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[13]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[14]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[15]  Jiajun Zhang,et al.  End-to-End Speech Translation with Knowledge Distillation , 2019, INTERSPEECH.

[16]  Qingfu Zhang,et al.  Pareto Multi-Task Learning , 2019, NeurIPS.

[17]  Dmitriy Genzel,et al.  Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task , 2021, ACL.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  James R. Glass,et al.  Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[20]  Jimmy J. Lin,et al.  What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning , 2019, ArXiv.

[21]  Alfons Juan-Císcar,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Dmytro Okhonko,et al.  fairseq S2T: Fast Speech-to-Text Modeling with fairseq , 2020, AACL.

[23]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[24]  Zhenglu Yang,et al.  Curriculum Pre-training for End-to-End Speech Translation , 2020, ACL.

[25]  Jiatao Gu,et al.  Self-Supervised Representations Improve End-to-End Speech Translation , 2020, INTERSPEECH.

[26]  Dmitriy Genzel,et al.  A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Matteo Negri,et al.  End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 , 2020, IWSLT.

[28]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[29]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[30]  Matteo Negri,et al.  One-to-Many Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[32]  Hermann Ney,et al.  A Comparative Study on End-to-End Speech to Text Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[34]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[35]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[36]  Zhenglu Yang,et al.  Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation , 2020, AAAI.

[37]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[38]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[39]  Qiantong Xu,et al.  Self-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[40]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[45]  Sathish Reddy Indurthi,et al.  End-end Speech-to-Text Translation with Modality Agnostic Meta-Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[47]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[48]  Nadir Durrani,et al.  FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[49]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[50]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[52]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[53]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[54]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[55]  Marjan Ghazvininejad,et al.  Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation , 2020, ArXiv.