Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition

Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating self-attention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy (AG-FBP) to dynamically calculate the fusion weight of two modalities is devised based on the emotion-related representation vectors from the attention mechanism of respective modalities. Finally, to fully utilize the local emotion information, adaptive and multi-level FBP (AM-FBP) is introduced by combining both global-trunk and intra-trunk data in one recording on top of AG-FBP. Tested on the IEMOCAP corpus for speech emotion recognition with only audio stream, the new FCN method outperforms the state-of-the-art results with an accuracy of 71.40%. Moreover, validated on the AFEW database of EmotiW2019 sub-challenge and the IEMOCAP corpus for audio-visual emotion recognition, the proposed AM-FBP approach achieves the best accuracy of 63.09% and 75.49% respectively on the test set.

[1]  Hui Zhang,et al.  Learning Alignment for Multimodal Emotion Recognition from Speech , 2019, INTERSPEECH.

[2]  Yongming Huang,et al.  Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition , 2015, IET Signal Process..

[3]  Jiaying Liu,et al.  Factorized Bilinear Models for Image Recognition , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Chun Chen,et al.  Audio-visual based emotion recognition - a new approach , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[5]  Abhinav Dhall,et al.  EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks , 2019, ICMI.

[6]  Enzo Pasquale Scilingo,et al.  Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients , 2015, Biomed. Signal Process. Control..

[7]  Takashi Nose,et al.  Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition , 2020, INTERSPEECH.

[8]  Qin Jin,et al.  Cross-culture Multimodal Emotion Recognition with Adversarial Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Yuanyuan Zhang,et al.  Attention Based Fully Convolutional Network for Speech Emotion Recognition , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Victor O. K. Li,et al.  Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[12]  Yu Qiao,et al.  Frame Attention Networks for Facial Expression Recognition in Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[13]  Zhao Ren,et al.  Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition , 2019, IEEE Access.

[14]  Tamás D. Gedeon,et al.  EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction , 2018, ICMI.

[15]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Chunjun Zheng,et al.  Emotion Recognition Model Based on Multimodal Decision Fusion , 2021 .

[17]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Kandarpa Kumar Sarma,et al.  Improving Emotion Identification Using Phone Posteriors in Raw Speech Waveform Based DNN , 2019, INTERSPEECH.

[19]  Wu Guo,et al.  An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition , 2018, INTERSPEECH.

[20]  Cheng Lu,et al.  Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild , 2018, ICMI.

[21]  Shrikanth Narayanan,et al.  Data Augmentation Using GANs for Speech Emotion Recognition , 2019, INTERSPEECH.

[22]  Ya Li,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[23]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[24]  Rajib Rana,et al.  Direct Modelling of Speech Emotion from Raw Speech , 2019, INTERSPEECH.

[25]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26]  Erik Cambria,et al.  Towards an intelligent framework for multimodal affective data analysis , 2015, Neural Networks.

[27]  Chi-Chun Lee,et al.  Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile , 2019, INTERSPEECH.

[28]  Cheng Lu,et al.  Bi-modality Fusion for Emotion Recognition in the Wild , 2019, ICMI.

[29]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[30]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[32]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[33]  Jian Huang,et al.  Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function , 2018, INTERSPEECH.

[34]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[35]  Maie Bachmann,et al.  Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[36]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[37]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[38]  Keiichiro Hoashi,et al.  Multi-Attention Fusion Network for Video-based Emotion Recognition , 2019, ICMI.

[39]  Natalia Efremova,et al.  Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video , 2017, ArXiv.

[40]  Yuxiao Hu,et al.  Training combination strategy of multi-stream fused hidden Markov model for audio-visual affect recognition , 2006, MM '06.

[41]  Longbiao Wang,et al.  Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation , 2020, INTERSPEECH.

[42]  Yunhong Wang,et al.  Continuous Emotion Recognition in Videos by Fusing Facial Expression, Head Pose and Eye Gaze , 2019, ICMI.

[43]  Minghao Wang,et al.  Multi-Feature Based Emotion Recognition for Video Clips , 2018, ICMI.

[44]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[46]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[47]  Hatice Gunes,et al.  Bi-modal emotion recognition from expressive face and body gestures , 2007, J. Netw. Comput. Appl..

[48]  Yuting Su,et al.  Multi-modal Correlated Network for emotion recognition in speech , 2019, Vis. Informatics.

[49]  Natalia Efremova,et al.  Leveraging Large Face Recognition Data for Emotion Classification , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[50]  C. Vinola,et al.  A Survey on Human Emotion Recognition Approaches, Databases and Applications , 2015 .

[51]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Kandarpa Kumar Sarma,et al.  Emotion Identification from Raw Speech Signals Using DNNs , 2018, INTERSPEECH.

[53]  Emily Mower Provost,et al.  Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  V. Štruc,et al.  Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[55]  Emily Mower Provost,et al.  Using regional saliency for speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[57]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[58]  Byung Cheol Song,et al.  Facial Expression Recognition via Relation-based Conditional Generative Adversarial Network , 2019, ICMI.

[59]  Xiaoying Tang,et al.  Geometry-Based Facial Expression Recognition via Large Deformation Diffeomorphic Metric Curve Mapping , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[60]  D. Bowers,et al.  Faces of emotion in Parkinsons disease: Micro-expressivity and bradykinesia during voluntary facial expressions , 2006, Journal of the International Neuropsychological Society.

[61]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[63]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[64]  Jun Du,et al.  Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[65]  Rajkumar Palaniappan,et al.  Detection of emotions in Parkinson's disease using higher order spectral features from brain's electrical activity , 2014, Biomed. Signal Process. Control..

[66]  Wan Khairunizam,et al.  Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals , 2017, Biomed. Signal Process. Control..

[67]  Byung Cheol Song,et al.  Recognizing Fine Facial Micro-Expressions Using Two-Dimensional Landmark Feature , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[68]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[69]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[70]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[71]  Linlin Shen,et al.  Disentangled Feature Based Adversarial Learning for Facial Expression Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[72]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[73]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[74]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[75]  Jun Du,et al.  Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition , 2019, ICMI.