论文信息 - Video Visual Relation Detection via Multi-modal Feature Fusion

Video Visual Relation Detection via Multi-modal Feature Fusion

Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.

[1] Ian D. Reid,et al. Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[5] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6] Shih-Fu Chang,et al. Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Chong-Wah Ngo,et al. Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[8] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[9] Tat-Seng Chua,et al. Object trajectory proposal , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[10] Tat-Seng Chua,et al. Video Visual Relation Detection , 2017, ACM Multimedia.

[11] Cordelia Schmid,et al. DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[12] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[13] Yu Cao,et al. Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[14] Yujie Wang,et al. Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Ali Farhadi,et al. Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Rui Caseiro,et al. High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[18] Shuicheng Yan,et al. Seq-NMS for Video Object Detection , 2016, ArXiv.

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Xu Sun,et al. Object Trajectory Proposal via Hierarchical Volume Grouping , 2018, ICMR.