Speech Emotion Recognition with Local-Global Aware Deep Representation Learning

Convolutional neural network (CNN) based deep representation learning methods for speech emotion recognition (SER) have demonstrated great success. The basic design of CNN restricts the ability to model only local information well. Capsule network (CapsNet) can overcome the shortages of CNNs to capture the shallow global features from the spectrogram, although CapsNet cannot learn the local and deep global information. In this paper, we propose a local-global aware deep representation learning system that mainly includes two modules. One module contains a multi-scale CNN, time- frequency CNN (TFCNN) to learn the local representation. In the other module, we introduce a structure with dense connections of multiple blocks to learn shallow and deep global information. Every block in this structure is a complete CapsNet improved by a new routing algorithm. The local and global representations are fed to the classifier and achieve an absolute increase of at least 4.25% than benchmarks on IEMOCAP.

[1]  Yunming Ye,et al.  Cross-Domain Sentiment Classification by Capsule Network With Semantic Rules , 2018, IEEE Access.

[2]  Jie Li,et al.  Speech Emotion Recognition Based on Mixed MFCC , 2012 .

[3]  Mubarak Shah,et al.  VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[4]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[5]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[6]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[7]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[8]  Min Yang,et al.  Investigating Capsule Networks with Dynamic Routing for Text Classification , 2018, EMNLP.

[9]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[10]  John Kim,et al.  Emotion Recognition from Human Speech Using Temporal Information and Deep Learning , 2018, INTERSPEECH.

[11]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[13]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[14]  S. Lalitha,et al.  Speech emotion recognition using DWT , 2015, 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC).

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Wu Guo,et al.  An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition , 2018, INTERSPEECH.

[17]  Jianwu Dang,et al.  Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network , 2018, INTERSPEECH.

[18]  Sergios Theodoridis,et al.  A dimensional approach to emotion recognition of speech from movies , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Emily Mower Provost,et al.  Using regional saliency for speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jianwu Dang,et al.  A Feature Fusion Method Based on Extreme Learning Machine for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[22]  Ibrahiem M. M. El Emary,et al.  Speech emotion recognition approaches in human computer interaction , 2013, Telecommun. Syst..