WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

The top-down and bottom-up methods are two mainstreams of referring segmentation, while both methods have their own intrinsic weaknesses. Top-down methods are chiefly disturbed by Polar Negative (PN) errors owing to the lack of fine-grained cross-modal alignment. Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information. Nevertheless, we discover that two types of methods are highly complementary for restraining respective weaknesses but the direct average combination leads to harmful interference. In this context, we build Win-win Cooperation (WiCo) to exploit complementary nature of two types of methods on both interaction and integration aspects for achieving a win-win improvement. For the interaction aspect, Complementary Feature Interaction (CFI) provides fine-grained information to top-down branch and introduces prior object information to bottom-up branch for complementary feature enhancement. For the integration aspect, Gaussian Scoring Integration (GSI) models the gaussian performance distributions of two branches and weightedly integrates results by sampling confident scores from the distributions. With our WiCo, several prominent top-down and bottom-up combinations achieve remarkable improvements on three common datasets with reasonable extra costs, which justifies effectiveness and generality of our method.

[1]  Zi Xuan Zhang,et al.  CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation , 2022, Neural Information Processing Systems.

[2]  Suha Kwak,et al.  ReSTR: Convolution-free Referring Image Segmentation Using Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yunhang Shen,et al.  SeqTR: A Simple yet Universal Network for Visual Grounding , 2022, ECCV.

[4]  Philip H. S. Torr,et al.  LAVT: Language-Aware Vision Transformer for Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Leonid Sigal,et al.  Referring Transformer: A One-step Approach to Multi-task Visual Grounding , 2021, NeurIPS.

[8]  Yizhou Yu,et al.  Bottom-Up Shift and Reasoning for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Huchuan Lu,et al.  Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tieniu Tan,et al.  Locate then Segment: A Strong Pipeline for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[13]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Liujuan Cao,et al.  Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming-Hsuan Yang,et al.  Referring Expression Object Segmentation with Caption-Aware Consistency , 2019, BMVC.

[16]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[22]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[23]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[24]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[25]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Michelle X. Zhou,et al.  Proceedings of the 28th ACM International Conference on Multimedia , 2009, MM 2009.

[29]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[30]  Richard A. Davis,et al.  Remarks on Some Nonparametric Estimates of a Density Function , 2011 .

[31]  Luc Van Gool,et al.  European conference on computer vision (ECCV) , 2006, eccv 2006.

[32]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[33]  E. Parzen Annals of Mathematical Statistics , 1962 .