Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.

[1]  Lonnie H. Hudgins,et al.  Wavelet transforms and atmopsheric turbulence. , 1993, Physical review letters.

[2]  Chuan Li,et al.  Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks , 2016, ECCV.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data , 2017, INTERSPEECH.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[9]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[10]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[11]  Madhuchhanda Mitra,et al.  Application of Cross Wavelet Transform for ECG Pattern Analysis and Classification , 2014, IEEE Transactions on Instrumentation and Measurement.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[14]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[15]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[16]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[17]  C. Torrence,et al.  A Practical Guide to Wavelet Analysis. , 1998 .

[18]  Aslak Grinsted,et al.  Nonlinear Processes in Geophysics Application of the Cross Wavelet Transform and Wavelet Coherence to Geophysical Time Series , 2022 .

[19]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[20]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[22]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tetsuya Takiguchi,et al.  Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[25]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[26]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[27]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[28]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform , 2016, SSW.

[30]  Shakir Mohamed,et al.  Variational Approaches for Auto-Encoding Generative Adversarial Networks , 2017, ArXiv.

[31]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[32]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[33]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[34]  Hirokazu Kameoka,et al.  Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Shigeyuki Hamori,et al.  Interdependence between oil and East Asian stock markets: Evidence from wavelet coherence analysis , 2017 .