FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

[1]  Elizabeth Salesky,et al.  The Multilingual TEDx Corpus for Speech Recognition and Translation , 2021, Interspeech 2021.

[2]  Juan Pino,et al.  CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus , 2020, LREC.

[3]  Kevin Duh,et al.  Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Yun Tang,et al.  Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. , 2020 .

[5]  Alfons Juan-Císcar,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[7]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[8]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[9]  Dmitriy Genzel,et al.  Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task , 2021, ACL.

[10]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[11]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[12]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[13]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[14]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[15]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[16]  Dmitriy Genzel,et al.  A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[18]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.