Self-attention fusion for audiovisual emotion recognition with incomplete data

In this paper, we consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms. While most of the previous works consider the ideal scenario of presence of both modalities at all times during inference, we evaluate the robustness of the model in the unconstrained settings where one modality is absent or noisy, and propose a method to mitigate these limitations in a form of modality dropout. Most importantly, we find that following this approach not only improves performance drastically under the absence/noisy representations of one modality, but also improves the performance in a standard ideal setting, outperforming the competing methods.

[1]  Amirreza Shaban,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[3]  Yong Jae Lee,et al.  Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[4]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[5]  Feng Zhou,et al.  Robust Lightweight Facial Expression Recognition Network with Label Distribution Training , 2021, AAAI.

[6]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[7]  Ankita Patil,et al.  Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks , 2020, INTERSPEECH.

[8]  Mohammad Soleymani,et al.  A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[9]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[10]  Fuchun Sun,et al.  Deep Multimodal Fusion by Channel Exchanging , 2020, NeurIPS.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Bin Liu,et al.  Multimodal Transformer Fusion for Continuous Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Zhenqi Li,et al.  A Review of Emotion Recognition Using Physiological Signals , 2018, Sensors.

[14]  Min Wu,et al.  A facial expression emotion recognition based human-robot interaction system , 2017, IEEE/CAA Journal of Automatica Sinica.

[15]  Moncef Gabbouj,et al.  Learning to ignore: rethinking attention in CNNs , 2021, BMVC.

[16]  Ruyi Xu,et al.  Automatic social signal analysis: Facial expression recognition using difference convolution neural network , 2019, J. Parallel Distributed Comput..

[17]  Dacheng Tao,et al.  Continuous Dropout , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Yi Xiao,et al.  Multimodal End-to-End Autonomous Driving , 2019, IEEE Transactions on Intelligent Transportation Systems.

[19]  Andrius Dzedzickis,et al.  Human Emotion Recognition: Review of Sensors and Methods , 2020, Sensors.

[20]  Alexandros Iosifidis,et al.  Speed-up and multi-view extensions to Subclass Discriminant Analysis , 2019, Pattern Recognit..