Video Visual Relation Detection via Multi-modal Feature Fusion

Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.

[1]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[8]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[9]  Tat-Seng Chua,et al.  Object trajectory proposal , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[10]  Tat-Seng Chua,et al.  Video Visual Relation Detection , 2017, ACM Multimedia.

[11]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[13]  Yu Cao,et al.  Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[14]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Ali Farhadi,et al.  Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[18]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xu Sun,et al.  Object Trajectory Proposal via Hierarchical Volume Grouping , 2018, ICMR.