Fully Motion-Aware Network for Video Object Detection

Video objection detection is challenging in the presence of appearance deterioration in certain video frames. One of typical solutions is to enhance per-frame features through aggregating neighboring frames. But the features of objects are usually not spatially calibrated across frames due to motion from object and camera. In this paper, we propose an end-to-end model called fully motion-aware network (MANet), which jointly calibrates the features of objects on both pixel-level and instance-level in a unified framework. The pixel-level calibration is flexible in modeling detailed motion while the instance-level calibration captures more global motion cues in order to be robust to occlusion. To our best knowledge, MANet is the first work that can jointly train the two modules and dynamically combine them according to the motion patterns. It achieves leading performance on the large-scale ImageNet VID dataset.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[8]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Stefan Roth,et al.  MirrorFlow: Exploiting Symmetries in Joint Optical Flow and Occlusion Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[25]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Xiaogang Wang,et al.  Gated Bi-directional CNN for Object Detection , 2016, ECCV.

[28]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[29]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[30]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[31]  Minh N. Do,et al.  Fast Guided Global Interpolation for Depth and Motion , 2016, ECCV.