Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

This paper presents a novel neural model for fine-grained style modeling and transfer in expressive text-to-speech (TTS) synthesis. By applying collaborative learning and adversarial learning strategies with thoughtfully designed loss functions, the proposed model is able to perform effective phoneme-level disentanglement of content factor and style factor of speech. Speech style transfer can be achieved by combining the style embedding extracted from a reference utterance with the phoneme embedding derived from the source text. Results of objective evaluation show that the synthesized speech preserves the intended content and carries similar prosody to the reference speech. Results of subjective evaluation show that the new model performs better than other fine-grained style transfer TTS models.

[1]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[2]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[5]  Srikanth Ronanki,et al.  Fine-grained robust prosody transfer for single-speaker neural text-to-speech , 2019, INTERSPEECH.

[6]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Daniel McDuff,et al.  Neural TTS Stylization with Adversarial and Collaborative Games , 2018, ICLR.

[8]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[9]  Haizhou Li,et al.  Expressive TTS Training with Frame and Style Reconstruction Loss , 2020, ArXiv.

[10]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[11]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[12]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Vincent Wan,et al.  CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[14]  Ashish Shrivastava,et al.  Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.