Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution, for effective F0 conversion. Experimental results show that our proposed framework outperforms the baselines both in objective and subjective evaluations.

[1]  Masato Akagi,et al.  Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space , 2018, Speech Commun..

[2]  Haizhou Li,et al.  A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder , 2018, INTERSPEECH.

[3]  Haizhou Li,et al.  Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[4]  Tetsuya Takiguchi,et al.  Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform , 2017, EURASIP Journal on Audio, Speech, and Music Processing.

[5]  Jo Yew Tham,et al.  Attribute Manipulation Generative Adversarial Networks for Fashion Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Haizhou Li,et al.  Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion , 2016, INTERSPEECH.

[7]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Hamidou Tembine,et al.  Nonparallel Emotional Speech Conversion , 2018, INTERSPEECH.

[10]  Tetsuya Takiguchi,et al.  Exemplar-based emotional voice conversion using non-negative matrix factorization , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[11]  Zhizheng Wu,et al.  On the use of I-vectors and average voice model for voice conversion without parallel data , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[14]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[17]  Tetsuya Takiguchi,et al.  High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion , 2014, INTERSPEECH.

[18]  Esther Klabbers,et al.  Decomposition of pitch curves in the general superpositional intonation model , 2006 .

[19]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[21]  Haizhou Li,et al.  Sparse representation of phonetic features for voice conversion with and without parallel data , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[23]  Haizhou Li,et al.  WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss , 2020, ArXiv.

[24]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[25]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[26]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[27]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[28]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[29]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion Using Dual Supervised Adversarial Networks With Continuous Wavelet Transform F0 Features , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Haizhou Li,et al.  Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion , 2018, INTERSPEECH.

[31]  Satoshi Nakamura,et al.  Speaker adaptation and voice conversion by codebook mapping , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[32]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[33]  Moncef Gabbouj,et al.  Hierarchical modeling of F0 contours for voice conversion , 2014, INTERSPEECH.

[34]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[36]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[37]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Haizhou Li,et al.  SINGAN: Singing Voice Conversion with Generative Adversarial Networks , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[40]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data , 2017, INTERSPEECH.

[41]  Junichi Yamagishi,et al.  Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis , 2018, Speech Commun..

[42]  Haizhou Li,et al.  Transformation of prosody in voice conversion , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[43]  Tetsuya Takiguchi,et al.  Emotional voice conversion using deep neural networks with MCC and F0 features , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[44]  Haizhou Li,et al.  Exemplar-based sparse representation of timbre and prosody for voice conversion , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Hao Wang,et al.  Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams , 2016, INTERSPEECH.

[46]  Haizhou Li,et al.  Teacher-Student Training For Robust Tacotron-Based TTS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Lirong Dai,et al.  Emotional statistical parametric speech synthesis using LSTM-RNNs , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[48]  Yi Xu SPEECH PROSODY : A METHODOLOGICAL REVIEW , 2011 .

[49]  Michael Lenz,et al.  Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis , 2003, INTERSPEECH.

[50]  Jo Yew Tham,et al.  Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network , 2020, Pattern Recognit. Lett..

[51]  Haizhou Li,et al.  On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[52]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[53]  Haizhou Li,et al.  Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[55]  Axel Röbel,et al.  Sequence-to-sequence Modelling of F0 for Speech Emotion Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Jo Yew Tham,et al.  Semantically Consistent Hierarchical Text to Fashion Image Synthesis with an Enhanced-Attentional Generative Adversarial Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[60]  Chi-Keung Tang,et al.  Attribute-Guided Face Generation Using Conditional CycleGAN , 2017, ECCV.

[61]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[62]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[63]  Haizhou Li,et al.  Phonetically Aware Exemplar-Based Prosody Transformation , 2018, Odyssey.

[64]  S. Ramakrishnan,et al.  Speech Enhancement, Modeling And Recognition: Algorithms And Applications , 2014 .

[65]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[66]  Chi-Keung Tang,et al.  Conditional CycleGAN for Attribute Guided Face Image Generation , 2017, ArXiv.

[67]  Haizhou Li,et al.  Emotional facial expression transfer based on temporal restricted Boltzmann machines , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[68]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.