Contrastive Grouping with Transformer for Referring Image Segmentation

Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CG-Former outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly. Code is available at https://github.com/ Toneyaya/CGFormer .

[1]  Suha Kwak,et al.  ReSTR: Convolution-free Referring Image Segmentation Using Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shalini De Mello,et al.  GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Alexander G. Schwing,et al.  Mask2Former for Video Instance Segmentation , 2021, ArXiv.

[4]  S. Bai,et al.  SeqFormer: Sequential Transformer for Video Instance Segmentation , 2021, ECCV.

[5]  Philip H. S. Torr,et al.  LAVT: Language-Aware Vision Transformer for Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yu-Gang Jiang,et al.  Two-stage Visual Cues Enhancement Network for Referring Image Segmentation , 2021, ACM Multimedia.

[8]  Xudong Jiang,et al.  Vision-Language Transformer and Query Generation for Referring Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[10]  Yizhou Yu,et al.  Bottom-Up Shift and Reasoning for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Huchuan Lu,et al.  Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tieniu Tan,et al.  Locate then Segment: A Strong Pipeline for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[15]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shih-Fu Chang,et al.  Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xiaoshuai Sun,et al.  Cascade Grouped Attention Network for Referring Expression Segmentation , 2020, ACM Multimedia.

[18]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[19]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[20]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Subhransu Maji,et al.  PhraseCut: Language-Based Image Segmentation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Huchuan Lu,et al.  Bi-Directional Relationship Inferring Network for Referring Image Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[24]  Yizhou Yu,et al.  Graph-Structured Referring Expression Reasoning in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Liujuan Cao,et al.  Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lu Yuan,et al.  Dynamic Convolution: Attention Over Convolution Kernels , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fei Wu,et al.  Dice Loss for Data-imbalanced NLP Tasks , 2019, ACL.

[28]  Ming-Hsuan Yang,et al.  Referring Expression Object Segmentation with Caption-Aware Consistency , 2019, BMVC.

[29]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[30]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Yizhou Yu,et al.  Dynamic Graph Attention for Referring Expression Comprehension , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Kan Chen,et al.  Zero-Shot Grounding of Objects From Natural Language Queries , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Yizhou Yu,et al.  Relationship-Embedded Representation Learning for Grounding Referring Expressions , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yizhou Yu,et al.  Cross-Modal Relationship Inference for Grounding Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Hongliang Li,et al.  Key-Word-Aware Network for Referring Expression Image Segmentation , 2018, ECCV.

[39]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[40]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Xiaodong Liu,et al.  Language-Based Image Editing with Recurrent Attentive Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[44]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[48]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[52]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[53]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[54]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[55]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[57]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[59]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[61]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[62]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[63]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Sibei Yang,et al.  Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding , 2023, ECCV.

[65]  Maxwell D. Collins,et al.  k-means Mask Transformer , 2022, ECCV.

[66]  Liang Lin,et al.  Structured Attention Network for Referring Image Segmentation , 2022, IEEE Transactions on Multimedia.

[67]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Yizhou Yu,et al.  Propagating Over Phrase Relations for One-Stage Visual Grounding , 2020, ECCV.

[69]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[70]  George W. Bohrnstedt,et al.  OF RANDOM VARIABLES , 2016 .

[71]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.