SOLQ: Segmenting Objects by Learning Queries

In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR [1], our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of original DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation. Code is available at https: //github.com/megvii-research/SOLQ.

[1]  Wenyu Liu,et al.  Boundary-preserving Mask R-CNN , 2020, ECCV.

[2]  Chunhua Shen,et al.  BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Hsueh-Ming Hang,et al.  Exploring Semantic Segmentation on the DCT Representation , 2019, MMAsia.

[6]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[7]  Luc Van Gool,et al.  Semantic Instance Segmentation with a Discriminative Loss Function , 2017, ArXiv.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[11]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ross B. Girshick,et al.  Boundary IoU: Improving Object-Centric Image Segmentation Evaluation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[16]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ross B. Girshick,et al.  PointRend: Image Segmentation As Rendering , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yuning Jiang,et al.  SOLO: Segmenting Objects by Locations , 2019, ECCV.

[21]  Ping Luo,et al.  PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Qi Tian,et al.  CenterNet: Keypoint Triplets for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Chunhua Shen,et al.  Mask Encoding for Single Shot Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[32]  Ling Shao,et al.  ISTR: End-to-End Instance Segmentation with Transformers , 2021, ArXiv.

[33]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[35]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[36]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[39]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[40]  Ming Yang,et al.  SSAP: Single-Shot Instance Segmentation With Affinity Pyramid , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Bing Deng,et al.  DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  CenterMask: Real-Time Anchor-Free Instance Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Deep Snake for Real-Time Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[45]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[46]  René Vidal,et al.  Sparse Dictionaries for Semantic Segmentation , 2014, ECCV.

[47]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[48]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[49]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Learning in the Frequency Domain , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[53]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[55]  Sanja Fidler,et al.  SGN: Sequential Grouping Networks for Instance Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Tao Kong,et al.  SOLOv2: Dynamic, Faster and Stronger , 2020, ArXiv.

[57]  Xinlei Chen,et al.  TensorMask: A Foundation for Dense Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).