Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams

The N10 system in the Voice Conversion Challenge 2018 (VCC 2018) has achieved high voice conversion (VC) performance in terms of speech naturalness and speaker similarity. We believe that further improvements can be gained from joint optimization (instead of separate optimization) of the conversion model and WaveNet vocoder, as well as leveraging information from the acoustic representation of the speech waveform, e.g. from Mel-spectrograms. In this paper, we propose a VC architecture to jointly train a conversion model that maps phonetic posteriorgrams (PPGs) to Mel-spectrograms and a WaveNet vocoder. The conversion model has a bottle-neck layer, whose outputs are concatenated with PPGs before being fed into the WaveNet vocoder as local conditioning. A weighted sum of a Melspectrogram prediction loss and a WaveNet loss is used as the objective function to jointly optimize parameters of the conversion model and the WaveNet vocoder. Objective and subjective evaluation results show that the proposed approach is capable of achieving significantly improved quality in voice conversion in terms of speech naturalness and speaker similarity of the converted speech for both cross-gender and intra-gender conversions.

[1]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[3]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tetsuya Takiguchi,et al.  Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[9]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[11]  Xunying Liu,et al.  Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[12]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[15]  Hui Lu,et al.  A Compact Framework for Voice Conversion Using Wavenet Conditioned on Phonetic Posteriorgrams , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Xunying Liu,et al.  The HCCL-CUHK System for the Voice Conversion Challenge 2018 , 2018, Odyssey.

[17]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[18]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[19]  Bo Chen,et al.  High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder , 2018, INTERSPEECH.

[20]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[21]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[22]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).