In Defense of Online Models for Video Instance Segmentation

. In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. How-ever, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highly desir-able if online models can achieve comparable or even better performance than offline models. By dissecting current online models and offline models, we demonstrate that the main cause of the performance gap is the error-prone association between frames caused by the similar appearance among different instances in the feature space. Observing this, we propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association and fully exploit history information for stability. Despite its simplicity, our method outperforms all online and offline methods on three benchmarks. Specifically, we achieve 49.5 AP on YouTube-VIS 2019, a significant improvement of 13.2 AP and 2.1 AP over the prior online and offline art, respectively. Moreover, we achieve 30.2 AP on OVIS, a more challenging dataset with significant crowding and occlusions, surpassing the prior art by 14.8 AP. The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022). We hope the simplicity and effectiveness of our method, as well as our insight on current methods, could shed light on the exploration of VIS models. The code is available at https://github.com/wjf5203/VNext.

[1]  P. Luo,et al.  Towards Grand Unification of Object Tracking , 2022, ECCV.

[2]  G. Medioni,et al.  Efficient Video Instance Segmentation via Tracklet Query and Proposal , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Liqing Zhang,et al.  STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation , 2022, ECCV Workshops.

[4]  Liusheng Huang,et al.  Segment as Points for Efficient and Effective Online Multi-Object Tracking and Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Philip H. S. Torr,et al.  Occluded Video Instance Segmentation: A Benchmark , 2021, International Journal of Computer Vision.

[6]  S. Bai,et al.  SeqFormer: Sequential Transformer for Video Instance Segmentation , 2021, ECCV.

[7]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[8]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[9]  Martin Danelljan,et al.  Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation , 2021, NeurIPS.

[10]  Thuy C. Nguyen,et al.  1st Place Solution for YouTubeVOS Challenge 2021: Video Instance Segmentation , 2021, arXiv.org.

[11]  Seoung Wug Oh,et al.  Video Instance Segmentation using Inter-Frame Communication Transformers , 2021, NeurIPS.

[12]  Xinggang Wang,et al.  Crossover Learning for Fast Online Video Instance Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Lei Zhang,et al.  Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zeming Li,et al.  OTA: Optimal Transport Assignment for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jiaya Jia,et al.  Video Instance Segmentation with a Propose-Reduce Paradigm , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Ding Liu,et al.  CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation , 2020, AAAI.

[17]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Stefan Roth,et al.  MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking , 2020, International Journal of Computer Vision.

[19]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[20]  Trevor Darrell,et al.  Quasi-Dense Similarity Learning for Multiple Object Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[22]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[24]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[25]  Fahad Shahbaz Khan,et al.  SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation , 2020, ECCV.

[26]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[27]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[28]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[29]  Laura Leal-Taixé,et al.  STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos , 2020, ECCV.

[30]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[31]  Yichen Wei,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[33]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gedas Bertasius,et al.  Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[37]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[39]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[44]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).