End-to-end speech emotion recognition using multi-scale convolution networks

Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.

[1]  Ning An,et al.  Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[2]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[3]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[4]  Wenming Zheng,et al.  A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression , 2014, IEEE Signal Processing Letters.

[5]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[6]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yixin Chen,et al.  Multi-Scale Convolutional Neural Networks for Time Series Classification , 2016, ArXiv.

[8]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[9]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[13]  Emily Mower Provost,et al.  Cross-Corpus Acoustic Emotion Recognition with Multi-Task Learning: Seeking Common Ground While Preserving Differences , 2019, IEEE Transactions on Affective Computing.

[14]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.