MSAF: Multimodal Split Attention Fusion

Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.

[1]  Dinesh Manocha,et al.  M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues , 2020, AAAI.

[2]  Yingyu Liang,et al.  Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[3]  Ning Xu,et al.  Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[4]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Wei Liu,et al.  Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis , 2019, ArXiv.

[6]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[7]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[8]  Rita Noumeir,et al.  Infrared and 3D Skeleton Feature Fusion for RGB-D Action Recognition , 2020, IEEE Access.

[9]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[10]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[11]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[12]  Jennifer Williams,et al.  DNN Multimodal Fusion Techniques for Predicting Video Sentiment , 2018 .

[13]  Guofa Li,et al.  A Spontaneous Driver Emotion Facial Expression (DEFE) Dataset for Intelligent Vehicles: Emotions Triggered by Video-Audio Clips in Driving Scenarios , 2020, IEEE Transactions on Affective Computing.

[14]  Eric Granger,et al.  Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.

[15]  Feiping Nie,et al.  Dense Multimodal Fusion for Hierarchically Joint Representation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Stephen J. Maybank,et al.  Feedback Graph Convolutional Network for Skeleton-Based Action Recognition , 2020, IEEE Transactions on Image Processing.

[18]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[19]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[21]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[23]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Roger Zimmermann,et al.  MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis , 2020, ACM Multimedia.

[25]  Guangming Shi,et al.  SGM-Net: Skeleton-guided multimodal network for action recognition , 2020, Pattern Recognit..

[26]  Chongruo Wu,et al.  ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Frédéric Jurie,et al.  MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chirag N. Paunwala,et al.  Improved weight assignment approach for multimodal fusion , 2014, 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA).

[29]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[30]  Jianyou Wang,et al.  Speech Emotion Recognition with Dual-Sequence LSTM Architecture , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Minetada Osano,et al.  Towards recognizing emotion with affective dimensions through body gestures , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[32]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[33]  Stéphane Ayache,et al.  Majority Vote of Diverse Classifiers for Late Fusion , 2014, S+SSPR.

[34]  A. Murat Tekalp,et al.  Multimodal Speaker Identification Using Canonical Correlation Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[35]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Sridha Sridharan,et al.  Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition , 2018, Comput. Vis. Image Underst..

[38]  Songlong Xing,et al.  Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing , 2020, IEEE Transactions on Multimedia.

[39]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jong-Seok Lee,et al.  EmbraceNet: A robust deep learning architecture for multimodal classification , 2019, Inf. Fusion.

[41]  Junsong Yuan,et al.  Recognizing Human Actions as the Evolution of Pose Estimation Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Chengxin Li,et al.  Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[44]  Bao-Liang Lu,et al.  Multimodal emotion recognition using EEG and eye tracking data , 2014, 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[45]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Amirreza Shaban,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Zhongkai Sun,et al.  Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis , 2019, INTERSPEECH.

[49]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[50]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[51]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[52]  Christian Wolf,et al.  Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.