OMG-Seg: Is One Model Good Enough For All Segmentation?

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

[1]  Haobo Yuan,et al.  RAP-SAM: Towards Real-Time All-Purpose Segment Anything , 2024, ArXiv.

[2]  Chong Zhou,et al.  Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively , 2024, ArXiv.

[3]  Xinshun Wang,et al.  Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning , 2023, ArXiv.

[4]  Xinyang Geng,et al.  Sequential Modeling Enables Scalable Learning for Large Vision Models , 2023, ArXiv.

[5]  Hao Zhou,et al.  Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion , 2023, ArXiv.

[6]  Wenwei Zhang,et al.  DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection , 2023, ArXiv.

[7]  Wenwei Zhang,et al.  CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction , 2023, ArXiv.

[8]  Chen Change Loy,et al.  MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Liang-Chieh Chen,et al.  Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP , 2023, NeurIPS.

[10]  Pei Sun,et al.  Semantic-SAM: Segment and Recognize Anything at Any Granularity , 2023, ArXiv.

[11]  Trevor Darrell,et al.  Hierarchical Open-vocabulary Universal Image Segmentation , 2023, NeurIPS.

[12]  Bernard Ghanem,et al.  Towards Open Vocabulary Learning: A Survey , 2023, IEEE transactions on pattern analysis and machine intelligence.

[13]  Li Dong,et al.  Kosmos-2: Grounding Multimodal Large Language Models to the World , 2023, ArXiv.

[14]  Chen Change Loy,et al.  Explore In-Context Learning for 3D Point Cloud Understanding , 2023, NeurIPS.

[15]  Jiannan Wu,et al.  VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.

[16]  Chen Change Loy,et al.  Transformer-Based Visual Segmentation: A Survey , 2023, IEEE transactions on pattern analysis and machine intelligence.

[17]  Yong Jae Lee,et al.  Segment Everything Everywhere All at Once , 2023, NeurIPS.

[18]  Chunhua Shen,et al.  SegGPT: Segmenting Everything In Context , 2023, ArXiv.

[19]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Rui Wang,et al.  FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chen Change Loy,et al.  Correlational Image Modeling for Self-Supervised Visual Pre-Training , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  A. Torralba,et al.  Detecting Everything in the Open World: Towards Universal Object Detection , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  A. Torralba,et al.  Open-vocabulary Panoptic Segmentation with Embedding Modulation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jianfeng Gao,et al.  A Simple Framework for Open-Vocabulary Segmentation and Detection , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Jiannan Wu,et al.  Universal Instance Perception as Object Discovery and Retrieval , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shalini De Mello,et al.  Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Philip H. S. Torr,et al.  MOSE: A New Dataset for Video Object Segmentation in Complex Scenes , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Chen Change Loy,et al.  Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yong Jae Lee,et al.  Generalized Decoding for Pixel, Image, and Language , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Chunhua Shen,et al.  Images Speak in Images: A Generalist Painter for In-Context Visual Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Humphrey Shi,et al.  OneFormer: One Transformer to Rule Universal Image Segmentation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chang Liu,et al.  VLT: Vision-Language Transformer and Query Generation for Referring Segmentation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  A. Piergiovanni,et al.  F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models , 2022, ArXiv.

[35]  Alexei A. Efros,et al.  Visual Prompting via Image Inpainting , 2022, NeurIPS.

[36]  Anima Anandkumar,et al.  MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training , 2022, NeurIPS.

[37]  A. Yuille,et al.  In Defense of Online Models for Video Instance Segmentation , 2022, ECCV.

[38]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[39]  Chen Change Loy,et al.  Masked Frequency Modeling for Self-Supervised Visual Pre-Training , 2022, ICLR.

[40]  David J. Fleet,et al.  A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[41]  Yunchao Wei,et al.  Large-scale Video Panoptic Segmentation in the Wild: A Benchmark , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Liang-Chieh Chen,et al.  TubeFormer-DeepLab: Video Mask Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jifeng Dai,et al.  Vision Transformer Adapter for Dense Predictions , 2022, ICLR.

[44]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[45]  Chen Change Loy,et al.  Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  D. Tao,et al.  Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation , 2022, ECCV.

[47]  Chen Change Loy,et al.  Open-Vocabulary DETR with Conditional Matching , 2022, ECCV.

[48]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[49]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Kilian Q. Weinberger,et al.  Language-driven Semantic Segmentation , 2022, ICLR.

[51]  Armand Joulin,et al.  Detecting Twenty-thousand Classes using Image-level Supervision , 2022, ECCV.

[52]  James Hays,et al.  MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Yin Cui,et al.  Scaling Open-Vocabulary Image Segmentation with Image-Level Labels , 2021, ECCV.

[54]  Alexander G. Schwing,et al.  Mask2Former for Video Instance Segmentation , 2021, ArXiv.

[55]  S. Bai,et al.  SeqFormer: Sequential Transformer for Video Instance Segmentation , 2021, ECCV.

[56]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[59]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[60]  David J. Crandall,et al.  A Survey on Deep Learning Technique for Video Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Kai Chen,et al.  K-Net: Towards Unified Image Segmentation , 2021, NeurIPS.

[62]  Jiaxu Miao,et al.  VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[64]  Du Tran,et al.  Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[66]  A. Yuille,et al.  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Shih-Fu Chang,et al.  Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[70]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[71]  Jianping Shi,et al.  Improving Semantic Segmentation via Decoupled Body and Edge Supervision , 2020, ECCV.

[72]  A. Yuille,et al.  DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  In So Kweon,et al.  Video Panoptic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[75]  Kuiyuan Yang,et al.  Semantic Flow for Fast and Accurate Scene Parsing , 2020, ECCV.

[76]  Antonio J. Plaza,et al.  Image Segmentation Using Deep Learning: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Yuning Jiang,et al.  SOLO: Segmenting Objects by Locations , 2019, ECCV.

[78]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[79]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[80]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Luc Van Gool,et al.  The 2018 DAVIS Challenge on Video Object Segmentation , 2018, ArXiv.

[85]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[86]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[88]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[89]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[90]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[91]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[92]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[94]  Maxwell D. Collins,et al.  k-means Mask Transformer , 2022, ECCV.

[95]  Antonio Criminisi,et al.  Object Class Segmentation using Random Forests , 2008, BMVC.

[96]  Thomas K. Leung,et al.  Contour and Texture Analysis for Image Segmentation , 2001, International Journal of Computer Vision.

[97]  Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation Supplementary , 2022 .