A Simple Learning Framework for Large Vocabulary Video Object Detection

Applying deep learning in the video domain fundamentally suffers from the data-hungry issue, and the situation will become even more severe for more complex and challenging tasks. One promising direction we believe is leveraging already well-curated large-scale image data to complement the insufficient video data. However, jointly using multiple datasets [14], image and video labels, leads to several issues, detailed below. In this paper, we investigate the new problem of large vocabulary tracking, one of the essential milestones for dynamic world understanding AI agents. The task naturally lacks training labels as the data collection and annotation procedure is extremely expensive. As a remedy, leveraging the large-scale images is an attractive solution [3]. However, in doing so, we face three main issues: 1) lacking video supervision in images, 2) semantic label inconsistency between images and videos, and 3) the domain gap (e.g., explicit data styles or implicit data distributions are different) between images and videos. The current learning paradigm bypasses the first two issues by independently training the detection head and tracking head with images and videos (decoupled). Instead, our learning framework explicitly handles the former two issues by hallucinating the supervisions and enables end-to-end video model learning from all training data, leading

[1]  Philipp Krähenbühl,et al.  Probabilistic two-stage detection , 2021, ArXiv.

[2]  Philipp Krähenbühl,et al.  Simple Multi-dataset Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ross B. Girshick,et al.  Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details , 2021, ArXiv.

[4]  Gang Zhang,et al.  Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kai Chen,et al.  Seesaw Loss for Long-Tailed Instance Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Trevor Darrell,et al.  Quasi-Dense Similarity Learning for Multiple Object Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Deva Ramanan,et al.  TAO: A Large-Scale Benchmark for Tracking Any Object , 2020, ECCV.

[8]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[9]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[12]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.