Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

Speech emotion recognition (SER) plays an important role in intelligent speech interaction. One vital challenge in SER is to extract emotion-relevant features from speech signals. In state-of-the-art SER techniques, deep learning methods, e.g, Convolutional Neural Networks (CNNs), are widely employed for feature learning and have achieved significant performance. However, in the CNN-oriented methods, two performance limitations have raised: 1) the loss of temporal structure of speech in the progressive resolution reduction; 2) the ignoring of relative dependencies between elements in suprasegmental feature sequence. In this paper, we proposed the combining use of Dilated Residual Network (DRN) and Multi-head Self-attention to alleviate the above limitations. By employing DRN, the network can retain high resolution of temporal structure in feature learning, with similar size of receptive field to CNN based approach. By employing Multi-head Self-attention, the network can model the inner dependencies between elements with different positions in the learned suprasegmental feature sequence, which enhances the importing of emotion-salient information. Experiments on emotional benchmarking dataset IEMOCAP have demonstrated the effectiveness of the proposed framework, with 11.7% to 18.6% relative improvement to state-of-the-art approaches.

[1]  Dik J. Hermes,et al.  Expression of emotion and attitude through temporal speech variations , 2000, INTERSPEECH.

[2]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[3]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[6]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[7]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hui Jiang,et al.  Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition , 2013, INTERSPEECH.

[9]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  George Trigeorgis,et al.  The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.

[12]  Marie Tahon,et al.  Towards a Small Set of Robust Acoustic Features for Emotion Recognition: Challenges , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[16]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[17]  R. Adolphs,et al.  Emotion Perception from Face, Voice, and Touch: Comparisons and Convergence , 2017, Trends in Cognitive Sciences.

[18]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[19]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[21]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[22]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.