论文信息 - Cross-lingual Style Transfer with Conditional Prior VAE and Style Loss

Cross-lingual Style Transfer with Conditional Prior VAE and Style Loss

In this work we improve the style representation for cross-lingual style transfer. Speciﬁcally, we improve the Spanish representation across four styles, Newscaster, DJ, Excited, and Disappointed, whilst maintaining a single speaker identity for which we only have English samples. This is achieved using Learned Conditional Prior VAE (LCPVAE), a hierarchical Variational Auto Encoder (VAE) approach. A secondary VAE is introduced, conditioned on one-hot encoded style information, resulting in a structured embedding space of the primary VAE. This places utterances of the same style in similar locations of the latent space irrespective of language. We also experi-ment with extending this model by incorporating a style loss. We perform subjective evaluations for style similarity using native Spanish speakers, and show an average relative improvement over the baseline of 3.5% with statistical signiﬁcance (p-value<0.01) across all four styles. Interestingly the more expressive styles achieve a higher relative improvement of 4.4% compared to 2.6% for styles that are closer to neutral speech. We also demonstrate that this is whilst maintaining speaker similarity and in-lingual performance in all styles. Accent performance is maintained in three out of four styles with the exception of Excited, while naturalness performance is maintained in News and Disappointed styles.

[1] Arnaldo Cândido Júnior,et al. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ICML.

[2] Fengyu Yang,et al. Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Zhou Zhao,et al. PortaSpeech: Portable and High-Quality Generative Text-to-Speech , 2021, NeurIPS.

[4] June Sig Sung,et al. Cross-lingual Low Resource Speaker Adaptation Using Phonological Features , 2021, Interspeech.

[5] Edwin R. Hancock,et al. Two-Pathway Style Embedding for Arbitrary Voice Conversion , 2021, Interspeech.

[6] Frank K. Soong,et al. Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS , 2021, Interspeech.

[7] Thomas Drugman,et al. A learned conditional prior for the VAE acoustic space of a TTS system , 2021, Interspeech.

[8] Chao Weng,et al. VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention , 2021, ArXiv.

[9] Zhiyong Wu,et al. Adversarially learning disentangled speech representations for robust multi-factor voice conversion , 2021, Interspeech.

[10] Tao Li,et al. Controllable Emotion Transfer For End-to-End Speech Synthesis , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[11] Haizhou Li,et al. Expressive TTS Training With Frame and Style Reconstruction Loss , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Longbiao Wang,et al. Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis , 2021 .

[13] Jingzhou Yang,et al. Towards Universal Text-to-Speech , 2020, INTERSPEECH.

[14] Tian Huey Teh,et al. Phonological Features for 0-shot Multilingual Speech Synthesis , 2020, INTERSPEECH.

[15] Thomas Drugman,et al. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[16] Heiga Zen,et al. Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] R. Barra-Chicote,et al. Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18] Heiga Zen,et al. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[19] Oliver Watts,et al. Using generative modelling to produce varied intonation for speech synthesis , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[20] Heiga Zen,et al. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[21] Zhen-Hua Ling,et al. Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Heiga Zen,et al. Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[23] Matthew B Hoy. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants , 2018, Medical reference services quarterly.

[24] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Stefano Ermon,et al. Learning Hierarchical Features from Deep Generative Models , 2017, ICML.

[26] Leon A. Gatys,et al. Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.