论文信息 - 3-D Relation Network for visual relation recognition in videos

3-D Relation Network for visual relation recognition in videos

Abstract Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈 subject , predicate , object 〉 , such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.

[1] Shu Zhang,et al. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Bin Zhao,et al. HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Christian Ledig,et al. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[5] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Cordelia Schmid,et al. Actor-Centric Relation Network , 2018, ECCV.

[7] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[8] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[10] Ramakant Nevatia,et al. Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13] Shuicheng Yan,et al. Seq-NMS for Video Object Detection , 2016, ArXiv.

[14] Zhou Zhao,et al. Multi-interaction Network with Object Relation for Video Question Answering , 2019, ACM Multimedia.

[15] Klaus Schöffmann,et al. Collaborative Feature Maps for Interactive Video Search , 2017, MMM.

[16] Nicu Sebe,et al. Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[17] Tat-Seng Chua,et al. Video Visual Relation Detection , 2017, ACM Multimedia.

[18] Ali Farhadi,et al. Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yu Cao,et al. Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[20] Yujie Wang,et al. Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yuxin Peng,et al. Hierarchical Vision-Language Alignment for Video Captioning , 2018, MMM.

[23] Jun Yu,et al. On Exploring Undetermined Relationships for Visual Relationship Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Bodo Rosenhahn,et al. Natural Language Guided Visual Relationship Detection , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25] Gangshan Wu,et al. Video Visual Relation Detection via Multi-modal Feature Fusion , 2019, ACM Multimedia.

[26] Cordelia Schmid,et al. Circulant Temporal Encoding for Video Retrieval and Temporal Alignment , 2015, International Journal of Computer Vision.

[27] Limin Wang,et al. Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Yang Wang,et al. Visual Relationship Detection Using Joint Visual-Semantic Embedding , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[29] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[30] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] James J. Little,et al. Spatio-temporal Relational Reasoning for Video Question Answering , 2019, BMVC.

[32] Rui Caseiro,et al. High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Тараса Шевченка,et al. Quo vadis? , 2013, Clinical chemistry.

[34] Wei Liu,et al. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Shih-Fu Chang,et al. Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Tao Mei,et al. Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[39] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40] Tsuyoshi Murata,et al. {m , 1934, ACML.

[41] Bohyung Han,et al. Streamlined Dense Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[43] Sheng Liu,et al. SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Jiebo Luo,et al. Joint Commonsense and Relation Reasoning for Image and Video Captioning , 2020, AAAI.

[46] Ji Zhang,et al. Relationship Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Tat-Seng Chua,et al. Relation Understanding in Videos: A Grand Challenge Overview , 2019, ACM Multimedia.

[48] Xiaogang Wang,et al. ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Cordelia Schmid,et al. Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50] Yongdong Zhang,et al. Learning Multimodal Attention LSTM Networks for Video Captioning , 2017, ACM Multimedia.

[51] Shiliang Pu,et al. Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[52] Yang Wang,et al. Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).