Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Video visual relation detection (VidVRD) aims to describe all interacting objects in a video. Different from relationships in static images, videos contain an addition temporal channel. A majority of existing works divide a video into short segments, predict relationships in each segment, and merge them. Such methods cannot capture relations involving long motions. Predicting the same relationship across neighboring video segments is also inefficient. To address these issues, this work proposes a novel sliding-window scheme to simultaneously predict short-term and long-term relationships. We run windows with different kernel sizes on object tracklets to generate sub-tracklet proposals with different duration, while the computational load is similar to that in segment-based methods. To fully utilize spatial and temporal information in videos, we construct one spatial and one temporal graph and employ Graph Convloutional Network to generate contextual embedding for tracklet proposal compatibility evaluation. We only predict relationships on highly-compatible proposal pairs. Our method achieves state-of-the-art performance on both ImageNet-VidVRD and VidOR dataset across multiple tasks. Especially for ImageNet-VidVRD, we obtain an average of 3\% (R@50 from 8.07\% to 11.21\%) improvement under all evaluation metrics.

[1]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Gangshan Wu,et al.  Video Visual Relation Detection via Multi-modal Feature Fusion , 2019, ACM Multimedia.

[4]  Shuqiang Jiang,et al.  Deep Structured Learning for Visual Relationship Detection , 2018, AAAI.

[5]  Bodo Rosenhahn,et al.  Natural Language Guided Visual Relationship Detection , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yu Cao,et al.  Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[11]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Min Xu,et al.  Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios , 2015, ACM Multimedia.

[16]  Alessio Micheli,et al.  Neural Network for Graphs: A Contextual Constructive Approach , 2009, IEEE Transactions on Neural Networks.

[17]  Joan Bruna,et al.  Deep Convolutional Networks on Graph-Structured Data , 2015, ArXiv.

[18]  Meng Wang,et al.  Beyond Object Proposals: Random Crop Pooling for Multi-Label Image Recognition , 2016, IEEE Transactions on Image Processing.

[19]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[20]  Ji Zhang,et al.  Relationship Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Donald F. Towsley,et al.  Diffusion-Convolutional Neural Networks , 2015, NIPS.

[26]  Qi Wu,et al.  The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[28]  Shiliang Pu,et al.  Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[29]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[34]  Ali Farhadi,et al.  Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[36]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[39]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[41]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[42]  Tat-Seng Chua,et al.  Video Visual Relation Detection , 2017, ACM Multimedia.

[43]  Shizhe Chen,et al.  Relation Understanding in Videos , 2019, ACM Multimedia.

[44]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[46]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[48]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51]  Raphaël Troncy,et al.  Automatic fine-grained hyperlinking of videos within a closed collection using scene segmentation , 2014, ACM Multimedia.

[52]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.