论文信息 - Feature Flow: In-network Feature Flow Estimation for Video Object Detection

Feature Flow: In-network Feature Flow Estimation for Video Object Detection

Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent to the pixel displacement, a common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset. With this method,they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an \textbf{I}n-network \textbf{F}eature \textbf{F}low estimation module (IFF module) for video object detection. Without resorting pre-training on any additional dataset, our IFF module is able to directly produce \textbf{feature flow} which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on \textit{self-supervision}, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and sets a state-of-the-art performance on ImageNet VID.

[1] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jie Xu,et al. Multi-model ensemble with rich spatial information for object detection , 2020, Pattern Recognit..

[4] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Qinmu Peng,et al. Automatic Video Object Segmentation Based on Visual and Motion Saliency , 2019, IEEE Transactions on Multimedia.

[6] Zhaoxiang Zhang,et al. Sequence Level Semantics Aggregation for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Stephen Lin,et al. Integrated Object Detection and Tracking with Tracklet-Conditioned Detection , 2018, ArXiv.

[8] Bingbing Ni,et al. Video Object Segmentation Via Dense Trajectories , 2015, IEEE Transactions on Multimedia.

[9] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[10] Wei Liu,et al. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Tao Mei,et al. Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Zongpu Zhang,et al. Object Guided External Memory Network for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Jie Gu,et al. Progressive Sparse Local Attention for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Kai Chen,et al. Optimizing Video Object Detection via a Scale-Time Lattice , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Gorjan Alagic,et al. #p , 2019, Quantum information & computation.

[17] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[18] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yi Li,et al. Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Shuai Li,et al. Accurate and Robust Video Saliency Detection via Self-Paced Diffusion , 2020, IEEE Transactions on Multimedia.

[22] Xiaogang Wang,et al. Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Jianbo Shi,et al. Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[24] Tsuyoshi Murata,et al. {m , 1934, ACML.

[25] P. Alam. ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[26] Hao Wang,et al. Multi-scale structural kernel representation for object detection , 2021, Pattern Recognit..

[27] Yu Hen Hu,et al. Video Saliency Detection via Graph Clustering With Motion Energy and Spatiotemporal Objectness , 2019, IEEE Transactions on Multimedia.

[28] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[29] Yichen Wei,et al. Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Xiaogang Wang,et al. T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[31] Xiaogang Wang,et al. Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Ali Farhadi,et al. YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] 이화영. X , 1960, Chinese Plants Names Index 2000-2009.

[34] Jianhua Lu,et al. Hierarchical objectness network for region proposal generation and object detection , 2018, Pattern Recognit..

[35] Horst Bischof,et al. A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[36] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Shuicheng Yan,et al. Seq-NMS for Video Object Detection , 2016, ArXiv.

[38] Yujie Wang,et al. Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Meng Wang,et al. Gated CNN: Integrating multi-scale feature layers for object detection , 2020, Pattern Recognit..

[40] Chuang Gan,et al. End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Jitendra Malik,et al. Beyond Skip Connections: Top-Down Modulation for Object Detection , 2016, ArXiv.

[42] Andrew Zisserman,et al. Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[44] Yi-Hsuan Yang,et al. Weakly-Supervised Visual Instrument-Playing Action Detection in Videos , 2018, IEEE Transactions on Multimedia.

[45] Yong Jae Lee,et al. Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[46] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[47] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Yichen Wei,et al. Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.