Multi-Attention Fusion Network for Video-based Emotion Recognition

Humans routinely pay attention to important emotion information from visual and audio modalities without considering multimodal alignment issues, and recognize emotions by integrating important multimodal information at a certain interval. In this paper, we propose a multiple attention fusion network (MAFN) with the goal of improving emotion recognition performance by modeling human emotion recognition mechanisms. MAFN consists of two types of attention mechanisms: the intra-modality attention mechanism is applied to dynamically extract representative emotion features from a single modal frame sequences; the inter-modality attention mechanism is applied to automatically highlight specific modal features based on their importance. In addition, we define a multimodal domain adaptation method to have a positive effect on capturing interactions between modalities. MAFN achieved 58.65% recognition accuracy with the AFEW testing set, which is a significant improvement compared with the baseline of 41.07%.

[1]  Minghao Wang,et al.  Multi-Feature Based Emotion Recognition for Video Clips , 2018, ICMI.

[2]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Keiichiro Hoashi,et al.  Lightweight Deep Convolutional Neural Networks for Facial Expression Recognition , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[7]  Stefan Feuerriegel,et al.  Deep learning for affective computing: Text-based emotion recognition in decision support , 2018, Decis. Support Syst..

[8]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[9]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[10]  Abhinav Dhall,et al.  EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks , 2019, ICMI.

[11]  A. Mehrabian Silent Messages: Implicit Communication of Emotions and Attitudes , 1971 .

[12]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[13]  Cheng Lu,et al.  Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild , 2018, ICMI.

[14]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[15]  Yoshua Bengio,et al.  Challenges in representation learning: A report on three machine learning contests , 2013, Neural Networks.

[16]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Martin Kampel,et al.  Facial Expression Recognition using Convolutional Neural Networks: State of the Art , 2016, ArXiv.

[18]  Yichuan Tang,et al.  Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[22]  Paul Pu Liang,et al.  Computational Modeling of Human Multimodal Language : The MOSEI Dataset and Interpretable Dynamic Fusion , 2018 .

[23]  Wei Chen,et al.  Modality Attention for End-to-end Audio-visual Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27]  Che-Wei Huang,et al.  Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition , 2017, ArXiv.

[28]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[29]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.