SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.

[1]  A. Conneau,et al.  FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[2]  Shannon L. Spruit,et al.  No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[3]  Holger Schwenk,et al.  Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages , 2022, EMNLP.

[4]  Holger Schwenk,et al.  T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation , 2022, EMNLP.

[5]  James R. Glass,et al.  SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation , 2022, IEEE Journal of Selected Topics in Signal Processing.

[6]  J. Dean,et al.  Designing Effective Sparse Expert Models , 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Yossi Adi,et al.  Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation , 2022, INTERSPEECH.

[8]  A. Conneau,et al.  Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation , 2022, INTERSPEECH.

[9]  Li Dong,et al.  DeepNet: Scaling Transformers to 1, 000 Layers , 2022, ArXiv.

[10]  Michelle Tadmor Ramanovich,et al.  CVSS Corpus and Massively Multilingual Speech-to-Speech Translation , 2022, LREC.

[11]  H. Schwenk,et al.  Textless Speech-to-Speech Translation on Real Data , 2021, NAACL.

[12]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[13]  Michelle Tadmor Ramanovich,et al.  Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , 2021, ICML.

[14]  A. Polyak,et al.  Direct Speech-to-Speech Translation With Discrete Units , 2021, ACL.

[15]  Marc'Aurelio Ranzato,et al.  The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.

[16]  Daniel Matthew Cer,et al.  Language-agnostic BERT Sentence Embedding , 2020, ACL.

[17]  Juan Pino,et al.  CoVoST 2 and Massively Multilingual Speech Translation , 2021, Interspeech.

[18]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[20]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[21]  Douglas W. Oard,et al.  The Multilingual TEDx Corpus for Speech Recognition and Translation , 2021, Interspeech.

[22]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[23]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[24]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[25]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web , 2019, ACL.

[26]  Holger Schwenk,et al.  Multimodal and Multilingual Embeddings for Large-Scale Speech Mining , 2021, NeurIPS.

[27]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[28]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[29]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[30]  Juan Pino,et al.  CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus , 2020, LREC.

[31]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[32]  A. Sanchís,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Laurent Besacier,et al.  MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible , 2019, LREC.

[34]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[35]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[36]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[37]  Kyubyong Park,et al.  CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages , 2019, INTERSPEECH.

[38]  Ray Kurzweil,et al.  Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax , 2019, IJCAI.

[39]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[40]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[41]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[42]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[43]  Josef van Genabith,et al.  An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification , 2017, IEEE Journal of Selected Topics in Signal Processing.

[44]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[45]  Tomoki Toda,et al.  Improving translation of emphasis with pause prediction in speech-to-speech translation systems , 2015, IWSLT.

[46]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[47]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[49]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.