论文信息 - Residual Attention-Based Fusion for Video Classification

Residual Attention-Based Fusion for Video Classification

Video data is inherently multimodal and sequential. Therefore, deep learning models need to aggregate all data modalities while capturing the most relevant spatio-temporal information from a given video. This paper presents a multimodal deep learning framework for video classification using a Residual Attention-based Fusion (RAF) method. Specifically, this framework extracts spatio-temporal features from each modality using residual attention-based bidirectional Long Short-Term Memory and fuses the information using a weighted Support Vector Machine to handle the imbalanced data. Experimental results on a natural disaster video dataset show that our approach improves upon the state-of-the-art by 5% and 8% regarding F1 and MAP metrics, respectively. Most remarkably, our proposed residual attention model reaches a 0.95 F1-score and 0.92 MAP for this dataset.

Tianyi Wang | Samira Pouyanfar | Shu-Ching Chen

[1] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[4] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[5] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Mei-Ling Shyu,et al. Multimodal deep learning based on multiple correspondence analysis for disaster management , 2018, World Wide Web.

[8] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.