A feature temporal attention based interleaved network for fast video object detection

Object detection in videos is a fundamental technology for applications such as monitoring. Since video frames are treated as independent input images, static detectors ignore the temporal information of objects when detecting objects in videos, generating redundant calculations in the detection process. In this paper, based on the spatiotemporal continuity of video objects, we propose an attention-guided dynamic video object detection method for fast detection. We define two frame attributes as key frame and non-key frame, then extract complete or shallow features, respectively. Distinct from the fixed key frame strategy used in previous studies, by measuring the feature similarity between frames, we develop a new key frame decision method to adaptively determine the attributes of the current frame. For the extracted shallow features of non-key frames, semantic enhancement and feature temporal attention (FTA) based feature propagation are performed to generate high-level semantic features in the designed temporal attention based feature propagation module (TAFPM). Our method is evaluated on the ImageNet VID dataset. It runs at the speed of 21.53 fps, which is twice the speed of the base detector R-FCN. The mAP decline is only 0.2% compared to R-FCN. Effectively, the proposed method achieves comparable performance with the state-of-the-arts which focus on speed.

[1]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Shi Jianping,et al.  Low-Latency Video Semantic Segmentation , 2018, CVPR 2018.

[4]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Trevor Darrell,et al.  Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[6]  Saleem Abdullah,et al.  Utilizing Linguistic Picture Fuzzy Aggregation Operators for Multiple-Attribute Decision-Making Problems , 2020, Int. J. Fuzzy Syst..

[7]  Saleem Abdullah,et al.  Triangular picture fuzzy linguistic induced ordered weighted aggregation operators and its application on decision making problems , 2019, Math. Found. Comput..

[8]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Menglong Zhu,et al.  Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[11]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[14]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Lazim Abdullah,et al.  Linguistic Picture Fuzzy Dombi Aggregation Operators and Their Application in Multiple Attribute Group Decision Making Problem , 2019, Mathematics.

[16]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Saleem Abdullah,et al.  Linguistic Spherical Fuzzy Aggregation Operators and Their Applications in Multi-Attribute Decision Making Problems , 2019, Mathematics.

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[24]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[25]  Wei Liu,et al.  Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[29]  Ming-Hsuan Yang,et al.  Video Object Detection via Object-Level Temporal Aggregation , 2020, ECCV.

[30]  Kai Chen,et al.  Optimizing Video Object Detection via a Scale-Time Lattice , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Saleem Abdullah,et al.  Spherical fuzzy sets and its representation of spherical fuzzy t-norms and t-conorms , 2019, J. Intell. Fuzzy Syst..

[32]  Chun-Yi Lee,et al.  Dynamic Video Segmentation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Fei Wang,et al.  CentripetalNet: Pursuing High-Quality Keypoint Pairs for Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[35]  Larry S. Davis,et al.  AdaFrame: Adaptive Frame Selection for Fast Video Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[37]  Junzhi Yu,et al.  Temporally Identity-Aware SSD With Attentional LSTM , 2018, IEEE Transactions on Cybernetics.

[38]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[40]  Yue Cao,et al.  Memory Enhanced Global-Local Aggregation for Video Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Feng Jiang,et al.  A object detection and tracking method for security in intelligence of unmanned surface vehicles , 2020, Journal of Ambient Intelligence and Humanized Computing.

[42]  Menglong Zhu,et al.  Looking Fast and Slow: Memory-Guided Mobile Video Object Detection , 2019, ArXiv.

[43]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.