CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice (Ardila et al., 2020) speech corpus and the CoVoST 2 (Wang et al., 2021b) speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models.

[1]  Holger Schwenk,et al.  Multimodal and Multilingual Embeddings for Large-Scale Speech Mining , 2021, NeurIPS.

[2]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[3]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[4]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[5]  Silvia Bernardini,et al.  An Approach to Corpus-Based Interpreting Studies: Developing EPIC (European Parliament Interpreting Corpus) , 2007 .

[6]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[8]  Tie-Yan Liu,et al.  UWSpeech: Speech to Speech Translation for Unwritten Languages , 2020, AAAI.

[9]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Marco Gaido,et al.  Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation , 2021, EMNLP.

[11]  Heiga Zen,et al.  PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS , 2021, Interspeech 2021.

[12]  Donald S. Williamson,et al.  Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation , 2021, Interspeech 2021.

[13]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[14]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[15]  Kenneth Heafield,et al.  Direct simultaneous speech to speech translation , 2021, ArXiv.

[16]  Eiichiro Sumita,et al.  Comparative study on corpora for speech translation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[18]  Tomoki Toda,et al.  Constructing a speech translation system using simultaneous interpretation data , 2013, IWSLT.

[19]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Juan Pino,et al.  CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus , 2020, LREC.

[22]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Alex Waibel,et al.  JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[26]  Satoshi Nakamura,et al.  Speech-to-Speech Translation Between Untranscribed Unknown Languages , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Juan Pino,et al.  Textless Speech-to-Speech Translation on Real Data , 2021, ArXiv.

[28]  Quan Wang,et al.  Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition , 2021, Interspeech 2021.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[31]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[32]  Christopher Cieri,et al.  Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora , 2007, INTERSPEECH.

[33]  Yun Tang,et al.  Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. , 2020 .

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[36]  Shigeki Matsubara,et al.  CIAIR Simultaneous Interpretation Corpus , 2004 .

[37]  Matt Post,et al.  Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus , 2013, IWSLT.

[38]  Ye Jia,et al.  Translatotron 2: Robust direct speech-to-speech translation , 2021, ArXiv.

[39]  Adam Polyak,et al.  Direct speech-to-speech translation with discrete units , 2021, ArXiv.

[40]  Laurent Besacier,et al.  MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible , 2019, LREC.

[41]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[42]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[43]  Juan Pino,et al.  CoVoST 2 and Massively Multilingual Speech Translation , 2021, Interspeech.

[44]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[45]  Christopher Cieri,et al.  Speaker Recognition: Building the Mixer 4 and 5 Corpora , 2008, LREC.

[46]  Satoshi Nakamura,et al.  Transformer-Based Direct Speech-To-Speech Translation with Transcoder , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[47]  Richard Sproat,et al.  The Kestrel TTS text normalization system , 2014, Natural Language Engineering.

[48]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  He He,et al.  Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation , 2016, NAACL.

[50]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[51]  Tomoki Toda,et al.  Collection of a Simultaneous Translation Corpus for Comparative Analysis , 2014, LREC.

[52]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[53]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).