Enhancing Speech-to-Speech Translation with Multiple TTS Targets

It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translation (S2TT). However, there is a limited investigation into how the synthesized target speech would affect the S2ST models. In this work, we analyze the effect of changing synthesized target speech for direct S2ST models. We find that simply combining the target speech from different TTS systems can potentially improve the S2ST performances. Following that, we also propose a multi-task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. Extensive experiments demonstrate that our proposed framework achieves consistent improvements (2.8 BLEU) over the baselines on the Fisher Spanish-English dataset.

[1]  J. Niehues,et al.  LibriS2S: A German-English Speech-to-Speech Translation Corpus , 2022, LREC.

[2]  Yossi Adi,et al.  Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation , 2022, INTERSPEECH.

[3]  A. Conneau,et al.  Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation , 2022, INTERSPEECH.

[4]  Andy T. Liu,et al.  SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities , 2022, ACL.

[5]  Michelle Tadmor Ramanovich,et al.  CVSS Corpus and Massively Multilingual Speech-to-Speech Translation , 2022, LREC.

[6]  H. Schwenk,et al.  Textless Speech-to-Speech Translation on Real Data , 2021, NAACL.

[7]  Tomoki Toda,et al.  S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Michelle Tadmor Ramanovich,et al.  Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , 2021, ICML.

[9]  A. Polyak,et al.  Direct Speech-to-Speech Translation With Discrete Units , 2021, ACL.

[10]  Shinji Watanabe,et al.  ESPnet2-TTS: Extending the Edge of TTS Research , 2021, ArXiv.

[11]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jungil Kong,et al.  Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , 2021, ICML.

[13]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[14]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[15]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[16]  Satoshi Nakamura,et al.  Transformer-Based Direct Speech-To-Speech Translation with Transcoder , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[17]  Guillaume Fuchs,et al.  StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tie-Yan Liu,et al.  UWSpeech: Speech to Speech Translation for Unwritten Languages , 2020, AAAI.

[19]  Tie-Yan Liu,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[20]  Kenneth Heafield,et al.  Direct simultaneous speech to speech translation , 2021, ArXiv.

[21]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[22]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[23]  Shinji Watanabe,et al.  DiscreTalk: Text-to-Speech as a Machine Translation Problem , 2020, ArXiv.

[24]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[25]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  K. Takeda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[30]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Tomoki Toda,et al.  Preserving Word-Level Emphasis in Speech-to-Speech Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.

[33]  Satoshi Nakamura,et al.  Multilingual Speech-to-Speech Translation System: VoiceTra , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[34]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[35]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[36]  Enrique Vidal,et al.  Finite-state speech-to-speech translation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.