Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. This improvement is achieved using FiLM conditioning layers, alongside adversarial training that encourages disentanglement between prosodic information and speaker identity. The acoustic model inherits attractive qualities from FastSpeech 2, such as fast inference and local prosody attributes prediction for finer grained control over generation. Experimental results show that DaftExprt significantly outperforms strong baselines on prosody transfer tasks, while yielding naturalness comparable to stateof-the-art expressive models. Moreover, results indicate that adversarial training effectively discards speaker identity information from the prosody representation, which ensures Daft-Exprt will consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.

[1]  Soroosh Mariooryad,et al.  Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis , 2019, ArXiv.

[2]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[3]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[4]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[5]  Method for the subjective assessment of intermediate quality level of , 2014 .

[6]  Yoshua Bengio,et al.  Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition , 2017, INTERSPEECH.

[7]  Adrian La'ncucki FastPitch: Parallel Text-to-speech with Pitch Prediction , 2020, ArXiv.

[8]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[9]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[10]  Oliver Watts,et al.  Where do the improvements come from in sequence-to-sequence neural TTS? , 2019 .

[11]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[12]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tao Qin,et al.  AdaSpeech: Adaptive Text to Speech for Custom Voice , 2021, ICLR.

[14]  Eunho Yang,et al.  Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation , 2021, ICML.

[15]  Kerstin Fischer,et al.  The Sound Makes the Greeting: Interpersonal Functions of Intonation in Human-Robot Interaction , 2015, AAAI Spring Symposia.

[16]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[18]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[19]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[20]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[21]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[22]  Thomas Drugman,et al.  Towards Achieving Robust Universal Neural Vocoding , 2018, INTERSPEECH.

[23]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[24]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hung-yi Lee,et al.  Again-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[27]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[28]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[30]  S. King,et al.  The Blizzard Challenge 2013 , 2013, The Blizzard Challenge 2013.