Online Video Object Detection via Local and Mid-Range Feature Propagation

This work proposes a new Local and Mid-range feature Propagation (LMP) method for video object detection to well capture feature correlations and reduce the redundant computation. Specifically, the proposed LMP model contains two modules with two individual propagation schemes. The local module is leveraged to propagate motion and appearance context in short term. The local module is a lightweight one to greatly reduce the redundant computation without considering local attention. On the other hand, to explore the feature correlations in long term, the mid-range module based on the non-local attention mechanism is introduced by capturing relative longer-range relationships. By incorporating these two modules, LMP enables to enrich feature representation with fast computation. The proposed method is evaluated on the ImageNet VID dataset. The proposed LMP method achieves 64.2% mAP score at speed of 28.5 FPS on desktop GPUs, which is the state-of-the-art performance among one-stage MobileNet based detectors. Source code is available at https://github.com/ktw361/Local-Mid-Propagation.

[1]  Tao Mei,et al.  MetaSearch: Incremental Product Search via Deep Meta-Learning , 2020, IEEE Transactions on Image Processing.

[2]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wei Liu,et al.  Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[6]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[7]  Diana Marculescu,et al.  AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling , 2019, MLSys.

[8]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Zhenan Sun,et al.  Foreground-Aware Pyramid Reconstruction for Alignment-Free Occluded Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Zhaoxiang Zhang,et al.  Sequence Level Semantics Aggregation for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Zhidong Deng,et al.  Fast Object Detection in Compressed Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[17]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Tao Mei,et al.  POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking , 2019, ACM Multimedia.

[19]  Yichen Wei,et al.  Towards High Performance Video Object Detection for Mobiles , 2018, ArXiv.

[20]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Jinhui Tang,et al.  Few-Shot Image Recognition With Knowledge Transfer , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yuning Chai,et al.  Patchwork: A Patch-Wise Attention Network for Efficient Object Detection and Segmentation in Video Streams , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Menglong Zhu,et al.  Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jie Gu,et al.  Progressive Sparse Local Attention for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[32]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Kai Chen,et al.  Optimizing Video Object Detection via a Scale-Time Lattice , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[40]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[41]  Jinhui Tang,et al.  CAD: Scale Invariant Framework for Real-Time Object Detection , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[42]  Yue Cao,et al.  Memory Enhanced Global-Local Aggregation for Video Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jonathan Huang,et al.  Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jing Liu,et al.  Robust Structured Subspace Learning for Data Representation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Zongpu Zhang,et al.  Object Guided External Memory Network for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Cewu Lu,et al.  Online Video Object Detection Using Association LSTM , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Menglong Zhu,et al.  Looking Fast and Slow: Memory-Guided Mobile Video Object Detection , 2019, ArXiv.

[51]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[52]  Nuno Vasconcelos,et al.  Cascade R-CNN: High Quality Object Detection and Instance Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.