Cross-lingual Style Transfer with Conditional Prior VAE and Style Loss

In this work we improve the style representation for cross-lingual style transfer. Specifically, we improve the Spanish representation across four styles, Newscaster, DJ, Excited, and Disappointed, whilst maintaining a single speaker identity for which we only have English samples. This is achieved using Learned Conditional Prior VAE (LCPVAE), a hierarchical Variational Auto Encoder (VAE) approach. A secondary VAE is introduced, conditioned on one-hot encoded style information, resulting in a structured embedding space of the primary VAE. This places utterances of the same style in similar locations of the latent space irrespective of language. We also experi-ment with extending this model by incorporating a style loss. We perform subjective evaluations for style similarity using native Spanish speakers, and show an average relative improvement over the baseline of 3.5% with statistical significance (p-value<0.01) across all four styles. Interestingly the more expressive styles achieve a higher relative improvement of 4.4% compared to 2.6% for styles that are closer to neutral speech. We also demonstrate that this is whilst maintaining speaker similarity and in-lingual performance in all styles. Accent performance is maintained in three out of four styles with the exception of Excited, while naturalness performance is maintained in News and Disappointed styles.

[1]  Arnaldo Cândido Júnior,et al.  YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ICML.

[2]  Fengyu Yang,et al.  Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Zhou Zhao,et al.  PortaSpeech: Portable and High-Quality Generative Text-to-Speech , 2021, NeurIPS.

[4]  June Sig Sung,et al.  Cross-lingual Low Resource Speaker Adaptation Using Phonological Features , 2021, Interspeech.

[5]  Edwin R. Hancock,et al.  Two-Pathway Style Embedding for Arbitrary Voice Conversion , 2021, Interspeech.

[6]  Frank K. Soong,et al.  Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS , 2021, Interspeech.

[7]  Thomas Drugman,et al.  A learned conditional prior for the VAE acoustic space of a TTS system , 2021, Interspeech.

[8]  Chao Weng,et al.  VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention , 2021, ArXiv.

[9]  Zhiyong Wu,et al.  Adversarially learning disentangled speech representations for robust multi-factor voice conversion , 2021, Interspeech.

[10]  Tao Li,et al.  Controllable Emotion Transfer For End-to-End Speech Synthesis , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[11]  Haizhou Li,et al.  Expressive TTS Training With Frame and Style Reconstruction Loss , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Longbiao Wang,et al.  Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis , 2021 .

[13]  Jingzhou Yang,et al.  Towards Universal Text-to-Speech , 2020, INTERSPEECH.

[14]  Tian Huey Teh,et al.  Phonological Features for 0-shot Multilingual Speech Synthesis , 2020, INTERSPEECH.

[15]  Thomas Drugman,et al.  CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[16]  Heiga Zen,et al.  Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  R. Barra-Chicote,et al.  Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[19]  Oliver Watts,et al.  Using generative modelling to produce varied intonation for speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[20]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[21]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[23]  Matthew B Hoy Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants , 2018, Medical reference services quarterly.

[24]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Stefano Ermon,et al.  Learning Hierarchical Features from Deep Generative Models , 2017, ICML.

[26]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.