Detection and tracking based tubelet generation for video object detection

Abstract Video object detection (VID) is a more challenging task compared with still-image object detection, which not only needs to detect objects accurately per frame but also needs to track objects for a long period of time. In order to detect objects from videos, we propose a Detection And Tracking (DAT) based tubelet generation framework. Under this framework, we first propose a detection-based tubelet generation method which can generate tubelets with more accurate bounding boxes compared with traditional tracking-based methods. On the other hand, the latter can produce a higher recall of bounding boxes than the former in general. To take advantage of their complementary attributes, we further propose a novel tubelet fusion method to combine these multi-modal information (appearance information in independent images and contextual information in videos). Our extensive experiments on the well-known ILSVRC 2016 VID dataset show that our proposed method can achieve state-of-the-art performances.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Deva Ramanan,et al.  Histograms of Sparse Codes for Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Meng Wang,et al.  Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction , 2018, IEEE Transactions on Knowledge and Data Engineering.

[6]  Joseph J. Lim,et al.  Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Bin Yang,et al.  CRAFT Objects from Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Simone Calderara,et al.  Visual Tracking: An Experimental Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Luming Zhang,et al.  Multiview Physician-Specific Attributes Fusion for Health Seeking , 2017, IEEE Transactions on Cybernetics.

[17]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Fei-Fei Li,et al.  Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[19]  Nicu Sebe,et al.  Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[21]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[22]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Meng Wang,et al.  A Framework of Joint Low-Rank and Sparse Regression for Image Memorability Prediction , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[27]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Pramod Sharma,et al.  Efficient Detector Adaptation for Object Detection in a Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[31]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Dumitru Erhan,et al.  Scalable Object Detection Using Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xuelong Li,et al.  Modeling Disease Progression via Multisource Multitask Learners: A Case Study With Alzheimer’s Disease , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.