Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis

Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-to-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-to-end model.

[1]  Lianhong Cai,et al.  Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Ming Zhou,et al.  Close to Human Quality TTS with Transformer , 2018, ArXiv.

[4]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[5]  Junichi Yamagishi,et al.  Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis , 2018, Speech Commun..

[6]  Yamato Ohtani,et al.  Emotional transplant in statistical speech synthesis based on emotion additive model , 2015, INTERSPEECH.

[7]  Minsoo Hahn,et al.  Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[9]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[10]  Xin Wang,et al.  Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis , 2018, ArXiv.

[11]  Yuxuan Wang,et al.  Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[13]  Masanobu Abe,et al.  An investigation to transplant emotional expressions in DNN-based TTS synthesis , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[14]  Junichi Yamagishi,et al.  Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[15]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[16]  Soo-Young Lee,et al.  Emotional End-to-End Neural Speech Synthesizer , 2017, NIPS 2017.

[17]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[18]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[19]  Sang Wan Lee,et al.  Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Lirong Dai,et al.  Emotional statistical parametric speech synthesis using LSTM-RNNs , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[21]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[22]  Soroosh Mariooryad,et al.  Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis , 2019, ArXiv.

[23]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ying Chen,et al.  Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[26]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[27]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[30]  Vincent Wan,et al.  CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network , 2019, ICML.

[31]  Zhong-Qiu Wang,et al.  Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).