论文信息 - Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning 10 50% of the pretrained parameters. This effectively leverages large pretrained models at low training cost such as wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation. This sets a new state-of-the-art for 36 translation directions (and surpassing cascaded ST for 26 of them) on the large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average for En-X directions and +6.7 BLEU for X-En directions). Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model (+5.6 BLEU on average across 28 non-English directions), making it an appealing approach for attaining high-quality speech translation with improved parameter and data efficiency.

[1] Dmitriy Genzel,et al. Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task , 2021, ACL.

[2] Dmitriy Genzel,et al. A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] J. Pino,et al. Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq , 2020, AACL.

[4] Yuqing Tang,et al. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[5] J. Pino,et al. CoVoST 2 and Massively Multilingual Speech-to-Text Translation , 2020 .

[6] Nadir Durrani,et al. FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[7] Ronan Collobert,et al. Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[8] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[9] Matteo Negri,et al. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 , 2020, IWSLT.

[10] Qiantong Xu,et al. Self-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[11] Sathish Reddy Indurthi,et al. End-end Speech-to-Text Translation with Modality Agnostic Meta-Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Asa Cooper Stickland,et al. Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation , 2020, EACL.

[13] Iryna Gurevych,et al. MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[14] Zhenglu Yang,et al. Curriculum Pre-training for End-to-End Speech Translation , 2020, ACL.

[15] James R. Glass,et al. Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[16] Armand Joulin,et al. Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Chris Dyer,et al. Learning Robust and Multilingual Speech Representations , 2020, FINDINGS.

[18] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[19] Hermann Ney,et al. A Comparative Study on End-to-End Speech to Text Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20] A. Sanchís,et al. Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Jimmy J. Lin,et al. What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning , 2019, ArXiv.

[22] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[23] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25] Matteo Negri,et al. One-to-Many Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26] M. Zhou,et al. Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation , 2019, AAAI.

[27] Jifeng Dai,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[28] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.

[29] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[30] Ankur Bapna,et al. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[31] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[32] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[33] Mark Dredze,et al. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[34] Jiajun Zhang,et al. End-to-End Speech Translation with Knowledge Distillation , 2019, INTERSPEECH.

[35] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[37] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[38] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[39] Rogério Schmidt Feris,et al. SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yuan Cao,et al. Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Adam Lopez,et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[42] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[43] David Chiang,et al. Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[44] Olivier Pietquin,et al. End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[46] Adam Lopez,et al. Towards speech-to-text translation without speech recognition , 2017, EACL.

[47] Olivier Pietquin,et al. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[48] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .