Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.

[1]  Jinyu Li,et al.  Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , 2023, ArXiv.

[2]  C. Chiu,et al.  Textless Direct Speech-to-Speech Translation with Discrete Speech Representation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jinyu Li,et al.  Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chao Weng,et al.  Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Ming Li,et al.  Cross-lingual multi-speaker speech synthesis with limited bilingual training data , 2022, Comput. Speech Lang..

[6]  Jingfei Du,et al.  SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations , 2022, ArXiv.

[7]  Joun Yeop Lee,et al.  An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space , 2022, ArXiv.

[8]  June Sig Sung,et al.  Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation , 2022, ArXiv.

[9]  Gabriel Synnaeve,et al.  High Fidelity Neural Audio Compression , 2022, ArXiv.

[10]  Jinyu Li,et al.  SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training , 2022, EMNLP.

[11]  Yaniv Taigman,et al.  AudioGen: Textually Guided Audio Generation , 2022, ICLR.

[12]  Jinyu Li,et al.  SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data , 2022, ArXiv.

[13]  Tanja Schultz,et al.  Normalization of code-switched text for speech synthesis , 2022, INTERSPEECH.

[14]  David Grangier,et al.  AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Yi Ren,et al.  TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation , 2022, ICLR.

[16]  Tie-Yan Liu,et al.  NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Tao Wang,et al.  GigaST: A 10, 000-hour Pseudo Speech Translation Corpus , 2022, ArXiv.

[18]  Lei He,et al.  Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training , 2022, ArXiv.

[19]  Michelle Tadmor Ramanovich,et al.  CVSS Corpus and Massively Multilingual Speech-to-Speech Translation , 2022, LREC.

[20]  H. Schwenk,et al.  Textless Speech-to-Speech Translation on Real Data , 2021, NAACL.

[21]  Arnaldo Cândido Júnior,et al.  YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ICML.

[22]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[23]  Lei Xie,et al.  WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Michelle Tadmor Ramanovich,et al.  Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , 2021, ICML.

[25]  A. Polyak,et al.  Direct Speech-to-Speech Translation With Discrete Units , 2021, ACL.

[26]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Chung-Cheng Chiu,et al.  w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Xiangang Li,et al.  GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[30]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[31]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[32]  Michelle Tadmor Ramanovich,et al.  Translatotron 2: Robust direct speech-to-speech translation , 2021, ArXiv.

[33]  Sebastian Möller,et al.  Deep Learning Based Assessment of Synthetic Speech Naturalness , 2020, INTERSPEECH.

[34]  Jingzhou Yang,et al.  Towards Universal Text-to-Speech , 2020, INTERSPEECH.

[35]  Brian Kan-Wing Mak,et al.  Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment , 2020, INTERSPEECH.

[36]  Shengkui Zhao,et al.  Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion , 2020, INTERSPEECH.

[37]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[38]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[39]  Songxiang Liu,et al.  Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Chunghyun Ahn,et al.  Emotional Speech Synthesis with Rich and Granularized Control , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[43]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[44]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[45]  Lior Wolf,et al.  Unsupervised Polyglot Text-to-speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[47]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[48]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[49]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  M. Wester The EMIME Bilingual Database , 2010 .

[51]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[53]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[54]  Alon Lavie,et al.  Janus-III: speech-to-speech translation in multiple languages , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.