Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis

The main goal of this work is to generate expressive speech in different speaker's voices for which no expressive speech data is available. The presented approach conditions Tacotron 2 speech synthesis with latent representations extracted from text, speaker identity, and reference expressive Mel spectrogram. We propose to use multiclass N-pair loss in the end-to-end multispeaker expressive Text-To-Speech (TTS) for improving the transfer of expressivity to the target speaker's voice. We have jointly trained the end-to-end (E2E) TTS with multiclass N-pair loss to discriminate between various emotions. This augmentation of the loss function during training paves the way to enhance the latent space representation of emotions. We have experimented with two different neural network architectures for expressivity in the encoder, namely global style token (GST) and variational autoencoder (VAE). We transferred the expressivity using the mean of latent representation extracted from the expressivity encoder for each emotion. The obtained results show that adding multiclass N-pair loss based deep metric learning in the training process improves expressivity in the desired speaker's voice.

[1]  Hasan Şakir Bilge,et al.  Deep Metric Learning: A Survey , 2019, Symmetry.

[2]  Damien Lolive,et al.  SynPaFlex-Corpus: An Expressive French Audiobooks Corpus dedicated to expressive speech synthesis , 2018, LREC.

[3]  Denis Jouvet,et al.  Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis , 2020, INTERSPEECH.

[4]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yutaka Matsuo,et al.  Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[6]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[7]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[8]  Stefan Winkler,et al.  Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives , 2016, Multimedia Systems.

[9]  Denis Jouvet,et al.  Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech , 2020, SLSP.

[10]  Marcela Charfuelan,et al.  Expressive speech synthesis in MARY TTS using audiobook data and emotionML , 2013, INTERSPEECH.

[11]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Junichi Yamagishi,et al.  The SIWIS French Speech Synthesis Database , 2017 .

[13]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[15]  Xudong Lin,et al.  Deep Variational Metric Learning , 2018, ECCV.

[16]  Thierry Dutoit,et al.  Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis , 2019, INTERSPEECH.

[17]  Marius Cotescu,et al.  Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Slim Ouni,et al.  Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis , 2019, INTERSPEECH.

[19]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[20]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[22]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[23]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.