End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are errorprone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an acoustic-to-word model. Specifically, we utilize the states of the decoder in the ASR model with the acoustic features and input them into the SER model. On top of a recurrent network to learn features from this input, we adopt a self-attention mechanism to focus on important feature frames. Finally, we finetune the ASR model on the new dataset using a multi-task learning method to jointly optimize ASR with the SER task. Our model has achieved a 68.63% weighted accuracy (WA) and 69.67% unweighted accuracy (UA) on the IEMOCAP database, which is state-of-the-art performance.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[4]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[5]  Najim Dehak,et al.  Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts , 2018, INTERSPEECH.

[6]  Yuexian Zou,et al.  Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition , 2018, INTERSPEECH.

[7]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[9]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[10]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Runnan Li,et al.  Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[15]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[18]  Kyomin Jung,et al.  Speech Emotion Recognition Using Multi-hop Attention Mechanism , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Homayoon S. M. Beigi,et al.  Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning , 2018, ArXiv.

[20]  Philip N. Garner,et al.  Self-Attention for Speech Emotion Recognition , 2019, INTERSPEECH.

[21]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[24]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.