Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network

We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture. The encoder constructs a latent embedding of the fundamental frequency (F0) contour and the spectrum, which we regularize using the Large Diffeomorphic Metric Mapping (LDDMM) registration framework. The decoder uses this embedding to predict the modified F0 contour in a target emotional class. Finally, the predictor uses the original spectrum and the modified F0 contour to generate a corresponding target spectrum. Our joint objective function simultaneously optimizes the parameters of three model blocks. We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech. In addition, the LDDMM regularization allows our model to convert phrases that were not present in training, thus providing evidence for out-of-sample generalization.

[1]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[3]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[4]  Haizhou Li,et al.  Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data , 2020, Odyssey.

[5]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Nikos Paragios,et al.  Deformable Medical Image Registration: A Survey , 2013, IEEE Transactions on Medical Imaging.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[9]  Alain Trouvé,et al.  Computing Large Deformation Metric Mappings via Geodesic Flows of Diffeomorphisms , 2005, International Journal of Computer Vision.

[10]  Nicolas Charon,et al.  Diffeomorphic Registration of Discrete Geometric Distributions , 2018, Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore.

[11]  Michael I. Miller,et al.  Landmark matching via large deformation diffeomorphisms , 2000, IEEE Trans. Image Process..

[12]  Yuxuan Wang,et al.  Uncovering Latent Style Factors for Expressive Speech Synthesis , 2017, ArXiv.

[13]  Haizhou Li,et al.  Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion , 2016, INTERSPEECH.

[14]  Zhiyong Wu,et al.  A Review of Deep Learning Based Speech Synthesis , 2019, Applied Sciences.

[15]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[16]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[17]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  L. Younes Shapes and Diffeomorphisms , 2010 .

[21]  Ravi Shankar,et al.  VESUS: A Crowd-Annotated Database to Study Emotion Production and Perception in Spoken English , 2019, INTERSPEECH.

[22]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[23]  Soroosh Mariooryad,et al.  Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis , 2019, ArXiv.

[24]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  D. Wegner,et al.  Psychology (2nd Edition) , 2011 .

[26]  Ravi Shankar,et al.  Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks , 2019, INTERSPEECH.

[27]  Ravi Shankar,et al.  A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective , 2019, INTERSPEECH.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.