A Compact Framework for Voice Conversion Using Wavenet Conditioned on Phonetic Posteriorgrams

Voice conversion can benefit from WaveNet vocoder with improvement in converted speech’s naturalness and quality. However, nowadays approaches segregate the training of conversion module and WaveNet vocoder towards different optimization objectives, which might lead to the difficulty in model tuning and coordination. In this paper, we propose a compact framework to unify the conversion and the vocoder parts. Multi-head self-attention structure and bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) are employed to encode speaker independent phonetic posteriorgrams (PPGs) into an intermediate representation which is used as the condition input of WaveNet to generate target speaker’s waveform. In this way, we unify the conversion and vocoder parts into a compact system in which all parameters can be tuned simultaneously for global optimization. We compared the proposed method with the baseline system that consists of separately trained conversion module and WaveNet vocoder. Subjective evaluations show that the proposed method can achieve better results in both naturalness and speaker similarity.

[1]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[2]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tomoki Toda,et al.  Statistical Voice Conversion with WaveNet-Based Waveform Generation , 2017, INTERSPEECH.

[4]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[9]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[10]  Tsao Yu,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016 .

[11]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[12]  Bo Chen,et al.  High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder , 2018, INTERSPEECH.

[13]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[18]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[22]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[23]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).