Speaker Attentive Speech Emotion Recognition

Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teaching the emotion recognition network about speaker identity. Our system is a combination of two ACRNN classifiers respectively dedicated to speaker and emotion recognition. The first informs the latter through a Self Speaker Attention (SSA) mechanism that is shown to considerably help to focus on emotional information of the speech signal. Experiments on social attitudes database Att-HACK and IEMOCAP corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall.

[1]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[3]  Hao Meng,et al.  Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network , 2019, IEEE Access.

[4]  Philip N. Garner,et al.  Context-Aware Attention Mechanism for Speech Emotion Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[6]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[7]  Paulo Menezes,et al.  Speaker Awareness for Speech Emotion Recognition , 2020, Int. J. Online Biomed. Eng..

[8]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  J. Bachorowski Vocal Expression and Perception of Emotion , 1999 .

[10]  Constantine Kotropoulos,et al.  Automatic speech classification to five emotional states based on gender information , 2004, 2004 12th European Signal Processing Conference.

[11]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[12]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[13]  Stefan Ultes,et al.  Emotions are a personal thing: Towards speaker-adaptive emotion recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[15]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[16]  Nicolas Obin,et al.  Att-HACK: An Expressive Speech Database with Social Attitudes , 2020, Speech Prosody 2020.

[17]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[18]  Jianwu Dang,et al.  Gender-Aware CNN-BLSTM for Speech Emotion Recognition , 2018, ICANN.

[19]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[20]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[21]  Haizhou Li,et al.  Multi-modal Attention for Speech Emotion Recognition , 2020, INTERSPEECH.

[22]  Berrak Sisman,et al.  Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).