Mining Inter-Video Proposal Relations for Video Object Detection

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection. However, these approaches mainly exploit the intra-proposal relation within single video, while ignoring the intra-proposal relation among different videos, which can provide important discriminative cues for recognizing confusing objects. To address the limitation, we propose a novel Inter-Video Proposal Relation module. Based on a concise multi-level triplet selection scheme, this module can learn effective object representations via modeling relations of hard proposals among different videos. Moreover, we design a Hierarchical Video Relation Network (HVR-Net), by integrating intra-video and inter-video proposal relations in a hierarchical fashion. This design can progressively exploit both intra and inter contexts to boost video object detection. We examine our method on the large-scale video object detection benchmark, i.e., ImageNet VID, where HVR-Net achieves the SOTA results. Codes and models are available at https://github.com/youthHan/HVRNet.

[1]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[2]  Xiaogang Wang,et al.  DeepID-Net: Deformable deep convolutional neural networks for object detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Zhaoxiang Zhang,et al.  Sequence Level Semantics Aggregation for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Xiaogang Wang,et al.  DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection , 2014, ArXiv.

[18]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Jie Gu,et al.  Progressive Sparse Local Attention for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Bin Yang,et al.  CRAFT Objects from Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Kai Chen,et al.  Optimizing Video Object Detection via a Scale-Time Lattice , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Zongpu Zhang,et al.  Object Guided External Memory Network for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[26]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[27]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[28]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Qi Tian,et al.  CenterNet: Keypoint Triplets for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Xingyi Zhou,et al.  Bottom-Up Object Detection by Grouping Extreme and Center Points , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Peng Gao,et al.  Video Object Detection with Locally-Weighted Deformable Neighbors , 2019, AAAI.

[35]  Wei Liu,et al.  Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[37]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[38]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Marios Savvides,et al.  Feature Selective Anchor-Free Module for Single-Shot Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.