Building Multi lingual TTS using Cross Lingual Voice Conversion

In this paper we propose a new cross-lingual Voice Conversion (VC) approach which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using several ASR acoustic models. Using the proposed VC method, we tried three different approaches to build a multilingual TTS system without recording a multilingual speech corpus. A listening test was carried out to evaluate both speech quality (naturalness) and voice similarity between converted speech and target speech. The results show that Approach 1 achieved the highest level of naturalness (3.28 MOS on a 5-point scale) and similarity (2.77 MOS).

[1]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[2]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[4]  Haifeng Li,et al.  A KL divergence and DNN approach to cross-lingual TTS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Haizhou Li,et al.  Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  K. Shikano,et al.  Statistical analysis of bilingual speaker's speech for cross-language voice conversion. , 1991, The Journal of the Acoustical Society of America.

[7]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  W. Marsden I and J , 2012 .

[9]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hao Wang,et al.  AA spectral space warping approach to cross-lingual voice transformation in HMM-based TTS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[13]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[14]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[16]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[17]  T. Nagarajan,et al.  A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis , 2015, Circuits, Systems, and Signal Processing.

[18]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[19]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[20]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[21]  Kiyohiro Shikano,et al.  Cross-language Voice Conversion Evaluation Using Bilingual Databases , 2002 .

[22]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[23]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[24]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[25]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[27]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[29]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[30]  Hao Wang,et al.  Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams , 2016, INTERSPEECH.