An End-to-end Multitask Learning Model to Improve Speech Emotion Recognition

In this paper, we propose an attention-based CNN-BLSTM model with the end-to-end (E2E) learning method. We first extract Mel-spectrogram from wav file instead of using handcrafted features. Then we adopt two types of attention mechanisms to let the model focuses on salient periods of speech emotions over the temporal dimension. Considering that there are many individual differences among people in expressing emotions, we incorporate speaker recognition as an auxiliary task. Moreover, since the training data set has a small sample size, we include data from another language as data augmentation. We evaluated the proposed method on SAVEE dataset by training it with single task, multitask, and cross-language. The evaluation shows that our proposed model achieves 73.62% for weighted accuracy and 71.11% for un-weighted accuracy in the task of speech emotion recognition, which outperforms the baseline with 11.13 points.

[1]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[2]  Sung Wook Baik,et al.  Deep features-based speech emotion recognition for smart affective services , 2017, Multimedia Tools and Applications.

[3]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[4]  Li Zhao,et al.  Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition , 2019, IEEE Access.

[5]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[6]  Chi-Chun Lee,et al.  Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile , 2019, INTERSPEECH.

[7]  Jithendra Vepa,et al.  Speech Emotion Recognition Using Spectrogram & Phoneme Embedding , 2018, INTERSPEECH.

[8]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Björn W. Schuller,et al.  The effect of personality trait, age, and gender on the performance of automatic speech valence recognition , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[10]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[11]  Ruchuan Wang,et al.  Speech Emotion Recognition Based on Multi-Task Learning , 2019, 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS).

[12]  David Matsumoto,et al.  Are Cultural Differences in Emotion Regulation Mediated by Personality Traits? , 2006 .

[13]  Gaurav Sahu,et al.  Multimodal Speech Emotion Recognition and Ambiguity Resolution , 2019, ArXiv.

[14]  Björn Schuller,et al.  The Automatic Recognition of Emotions in Speech , 2011 .

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Towards Temporal Modelling of Categorical Speech Emotion Recognition , 2018, INTERSPEECH.

[17]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[19]  Takuya Maekawa,et al.  Similarity of Speech Emotion in Different Languages Revealed by a Neural Network with Attention , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).