Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents.

[1]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[2]  Lior Wolf,et al.  Fitting New Speakers Based on a Short Untranscribed Sample , 2018, ICML.

[3]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lior Wolf,et al.  Unsupervised Polyglot Text-to-speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[7]  Taesu Kim,et al.  Learning pronunciation from a foreign language in speech synthesis networks , 2018, ArXiv.

[8]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[9]  Yutaka Matsuo,et al.  Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[10]  Sercan Ömer Arik,et al.  Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[11]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[12]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[13]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[14]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[15]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[16]  Yoshua Bengio,et al.  Representation Mixing for TTS Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[18]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[21]  Zhengchen Zhang,et al.  A light-weight method of building an LSTM-RNN-based bilingual tts system , 2017, 2017 International Conference on Asian Language Processing (IALP).

[22]  Walter Daelemans,et al.  Data-Oriented Methods for Grapheme-to-Phoneme Conversion , 1993, EACL.

[23]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[24]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Xin Wang,et al.  Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.

[26]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Heiga Zen,et al.  Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[28]  Hui Liang,et al.  Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech , 2011, INTERSPEECH.