A pairwise discriminative task for speech emotion recognition

Speech emotion recognition is an important task in human-machine interaction. However, it faces many challenges such as the ambiguity of emotion expression and the lack of training samples. To solve these problems, we propose a novel 'Pairwise discriminative task', which attempts to learn the similarity and distinction between two audios rather than specific labels. In the task, pairwise audios are fed into audio encode networks to extract audio vectors, followed with discrimination networks behind to judge whether audios belong to the same emotion category or not. The system is optimized in an end-to-end manner to minimize the loss function, which cooperates cosine similarity loss and cross entropy loss together. To verify the performance of audio representation vectors extracted from the system, we test them on IEMOCAP database-a common evaluation corpus. We gain 56.33% unweighted accuracy on the test database, which surpasses above 5% compared with traditional end-to-end speech emotion recognition networks.

[1]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  C. Vinola,et al.  A Survey on Human Emotion Recognition Approaches, Databases and Applications , 2015 .

[4]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[5]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[8]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[9]  Purnima Chandrasekar,et al.  Automatic Speech Emotion Recognition: A survey , 2014, 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA).

[10]  Tong Zhang,et al.  Multi-clue fusion for emotion recognition in the wild , 2016, ICMI.

[11]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[15]  Emad Barsoum,et al.  Emotion recognition in the wild from videos using images , 2016, ICMI.

[16]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[19]  Qingming Huang,et al.  Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks , 2015, ECCV.

[20]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[21]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .