论文信息 - SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.

[1] A. Conneau,et al. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[2] Shannon L. Spruit,et al. No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[3] Holger Schwenk,et al. Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages , 2022, EMNLP.

[4] Holger Schwenk,et al. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation , 2022, EMNLP.

[5] James R. Glass,et al. SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation , 2022, IEEE Journal of Selected Topics in Signal Processing.

[6] J. Dean,et al. Designing Effective Sparse Expert Models , 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7] Yossi Adi,et al. Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation , 2022, INTERSPEECH.

[8] A. Conneau,et al. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation , 2022, INTERSPEECH.

[9] Li Dong,et al. DeepNet: Scaling Transformers to 1, 000 Layers , 2022, ArXiv.

[10] Michelle Tadmor Ramanovich,et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation , 2022, LREC.

[11] H. Schwenk,et al. Textless Speech-to-Speech Translation on Real Data , 2021, NAACL.

[12] Juan Pino,et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[13] Michelle Tadmor Ramanovich,et al. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , 2021, ICML.

[14] A. Polyak,et al. Direct Speech-to-Speech Translation With Discrete Units , 2021, ACL.

[15] Marc'Aurelio Ranzato,et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.

[16] Daniel Matthew Cer,et al. Language-agnostic BERT Sentence Embedding , 2020, ACL.

[17] Juan Pino,et al. CoVoST 2 and Massively Multilingual Speech Translation , 2021, Interspeech.

[18] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.