Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Automatic emotion recognition from speech, which is an important and challenging task in the field of affective computing, heavily relies on the effectiveness of the speech features for classification. Previous approaches to emotion recognition have mostly focused on the extraction of carefully hand-crafted features. How to model spatio-temporal dynamics for speech emotion recognition effectively is still under active investigation. In this paper, we propose a method to tackle the problem of emotional relevant feature extraction from speech by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks in order to automatically learn the best spatio-temporal representations of speech signals. The learned high-level features are then fed into a deep neural network (DNN) to predict the final emotion. The experimental results on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) and the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpora show that our method provides more accurate predictions compared with other existing emotion recognition algorithms.

[1]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Yixin Chen,et al.  Predicting Hospital Readmission via Cost-Sensitive Deep Learning , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Brahim Chaib-draa,et al.  Parametric Exponential Linear Unit for Deep Convolutional Neural Networks , 2016, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[6]  Yang Liu,et al.  DBN-ivector Framework for Acoustic Emotion Recognition , 2016, INTERSPEECH.

[7]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[9]  Shiguang Shan,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[10]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[12]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[13]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[14]  Tong Zhang,et al.  Spatial–Temporal Recurrent Neural Network for Emotion Recognition , 2017, IEEE Transactions on Cybernetics.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Houshang Darabi,et al.  LSTM Fully Convolutional Networks for Time Series Classification , 2017, IEEE Access.

[17]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[18]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[21]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[25]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[26]  Tim Oates,et al.  Time series classification from scratch with deep neural networks: A strong baseline , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[27]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Zhong-Qiu Wang,et al.  Speech emotion recognition based on Gaussian Mixture Models and Deep Neural Networks , 2017, 2017 Information Theory and Applications Workshop (ITA).

[29]  Xiangang Li,et al.  Long short-term memory based convolutional recurrent neural networks for large vocabulary speech recognition , 2016, INTERSPEECH.

[30]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[31]  Ya Li,et al.  CHEAVD: a Chinese natural emotional audio–visual database , 2016, Journal of Ambient Intelligence and Humanized Computing.

[32]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[33]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[34]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[35]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[36]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.