Vision-Language Transformer and Query Generation for Referring Segmentation

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

[1]  Liujuan Cao,et al.  Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[4]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Xiaoshuai Sun,et al.  Cascade Grouped Attention Network for Referring Expression Segmentation , 2020, ACM Multimedia.

[8]  Gang Wang,et al.  Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[10]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[11]  Gang Wang,et al.  Boundary-Aware Feature Propagation for Scene Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Steve Renals,et al.  Dynamic Evaluation of Transformer Language Models , 2019, ArXiv.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Xudong Jiang,et al.  PhraseClick: Toward Achieving Flexible Interactive Segmentation by Phrase and Click , 2020, ECCV.

[20]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[24]  Yap-Peng Tan,et al.  Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Xudong Jiang,et al.  Semantic Correlation Promoted Shape-Variant Context for Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hongliang Li,et al.  Key-Word-Aware Network for Referring Expression Image Segmentation , 2018, ECCV.

[27]  Henghui Ding,et al.  Prototypical Matching and Open Set Rejection for Zero-Shot Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Huchuan Lu,et al.  Bi-Directional Relationship Inferring Network for Referring Image Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.