A Dialogical Emotion Decoder for Speech Motion Recognition in Spoken Dialog

Developing a robust emotion speech recognition (SER) system for human dialog is important in advancing conversational agent design. In this paper, we proposed a novel inference algorithm, a dialogical emotion decoding (DED) algorithm, that treats a dialog as a sequence and consecutively decode the emotion states of each utterance over time with a given recognition engine. This decoder is trained by incorporating intra- and inter-speakers emotion influences within a conversation. Our approach achieves a 70.1% in four class emotion on the IEMOCAP database, which is 3% over the state-of-art model. The evaluation is further conducted on a multi-party interaction database, the MELD, which shows a similar effect. Our proposed DED is in essence a conversational emotion rescoring decoder that can also be flexibly combined with different SER engines.

[1]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Theodora A. Varvarigou,et al.  A Distance-Dependent Chinese Restaurant Process Based Method for Event Detection on Social Media , 2018, Inventions.

[3]  S. Shott,et al.  Emotion and Social Life: A Symbolic Interactionist Analysis , 1979, American Journal of Sociology.

[4]  Paavo Alku,et al.  On the perception of emotions in speech: the role of voice quality , 1997 .

[5]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[6]  Soumya Ghosh,et al.  Spatial distance dependent Chinese restaurant processes for image segmentation , 2011, NIPS.

[7]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[8]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[9]  Rada Mihalcea,et al.  ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection , 2018, EMNLP.

[10]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Chi-Chun Lee,et al.  An Interaction-aware Attention Network for Speech Emotion Recognition in Spoken Dialogs , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Rada Mihalcea,et al.  DialogueRNN: An Attentive RNN for Emotion Detection in Conversations , 2018, AAAI.

[14]  Stefan Scherer,et al.  Learning representations of emotional speech with deep convolutional generative adversarial networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[17]  Eduard Hovy,et al.  Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances , 2019, IEEE Access.