Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition

Speech emotion recognition (SER) is a challenging task due to its difficulty in finding proper representations for emotion embedding in speech. Recently, Convolutional Recurrent Neural Network (CRNN), which is combined by convolution neural network and recurrent neural network, is popular in this field and achieves state-of-art on related corpus. However, most of work on CRNN only utilizes simple spectral information, which is not capable to capture enough emotion characteristics for the SER task. In this work, we investigate two joint representation learning structures based on CRNN aiming at capturing richer emotional information from speech. Cooperating the handcrafted high-level statistic features with CRNN, a two-channel SER system (HSF-CRNN) is developed to jointly learn the emotion-related features with better discriminative property. Furthermore, considering that the time duration of speech segment significantly affects the accuracy of emotion recognition, another two-channel SER system is proposed where CRNN features extracted from different time scale of spectrogram segment are used for joint representation learning. The systems are evaluated over Atypical Affect Challenge of ComParE2018 and IEMOCAP corpus. Experimental results show that our proposed systems outperform the plain CRNN.

[1]  R. van Bezooijen,et al.  Recognition of Vocal Expressions of Emotion , 1983 .

[2]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[3]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[4]  Che-Wei Huang,et al.  Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition , 2017, ArXiv.

[5]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[6]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[8]  Björn W. Schuller,et al.  Emotional Speech of Mentally and Physically Disabled Individuals: Introducing the EmotAsS Database and First Findings , 2017, INTERSPEECH.

[9]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[10]  Ibrahiem M. M. El Emary,et al.  Speech emotion recognition approaches in human computer interaction , 2013, Telecommun. Syst..

[11]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[12]  Yafeng Niu,et al.  A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks , 2017, ArXiv.

[13]  N. Anand,et al.  Convoluted Feelings Convolutional and recurrent nets for detecting emotion from audio data , 2015 .

[14]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[16]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[17]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).