Towards Discriminative Representation Learning for Speech Emotion Recognition

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multihead Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6% to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.

[1]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[2]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[5]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6]  R. Adolphs,et al.  Emotion Perception from Face, Voice, and Touch: Comparisons and Convergence , 2017, Trends in Cognitive Sciences.

[7]  Hai Zhao,et al.  Attention Is All You Need for Chinese Word Segmentation , 2019, EMNLP.

[8]  Jian Huang,et al.  Improving speech emotion recognition via Transformer-based Predictive Coding through transfer learning , 2018, ArXiv.

[9]  George Trigeorgis,et al.  The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.

[10]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[17]  Fabio Crestani,et al.  Proceedings of the 2006 ACM symposium on Applied computing , 2006 .

[18]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[19]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  T. Arad,et al.  Voice , 2005 .

[23]  Kirill V. Nourski,et al.  Representation of speech in human auditory cortex: Is it special? , 2013, Hearing Research.