StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

Recently, emotional speech synthesis has achieved remarkable performance. Furthermore, the emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function. However, a trained ranking function on specific data has poor generalization, which limit its applicability for more realistic cases. In this paper, we propose a deep learning based emotion strength assessment network for strength prediction that is referred to as StrengthNet. Our model conforms to a multi-task learning framework with a structure that includes an acoustic encoder, a strength predictor and an auxiliary emotion predictor. A data augmentation strategy was utilized to improve the model generalization. Experiments show that the predicted emotion strength of the proposed StrengthNet are highly correlated with ground truth scores for seen and unseen speech. Our codes are available at: https://github.com/ttslr/StrengthNet.

[1]  Helen Meng,et al.  Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset with the Assistance of Cross-Domain Speech Emotion Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[3]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[4]  Tao Li,et al.  Controllable Emotion Transfer For End-to-End Speech Synthesis , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[5]  Silvio Savarese,et al.  Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[6]  Berrak Sisman,et al.  Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability , 2021, Interspeech 2021.

[7]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[8]  Berrak Sisman,et al.  Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[10]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[11]  Geng Yang,et al.  Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Yu Tsao,et al.  Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[13]  Vikas Singh,et al.  Efficient Relative Attribute Learning Using Graph Neural Networks , 2018, ECCV.

[14]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[15]  Haizhou Li,et al.  Expressive TTS Training with Frame and Style Reconstruction Loss , 2020, ArXiv.

[16]  Yu Tsao,et al.  MOSNet: Deep Learning based Objective Assessment for Voice Conversion , 2019, INTERSPEECH.

[17]  Shan Yang,et al.  Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[18]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[19]  Philip N. Garner,et al.  Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction , 2021, 11th ISCA Speech Synthesis Workshop (SSW 11).