Meta Compositional Referring Expression Segmentation

Referring expression segmentation aims to segment an object described by a language expression from an image. Despite the recent progress on this task, existing models tackling this task may not be able to fully capture semantics and visual representations of individual concepts, which limits their generalization capability, especially when handling novel compositions of learned concepts. In this work, through the lens of meta learning, we propose a Meta Compositional Referring Expression Segmentation (MCRES) framework to enhance model compositional generalization performance. Specifically, to handle various levels of novel compositions, our framework first uses training data to construct a virtual training set and multiple virtual testing sets, where data samples in each virtual testing set contain a level of novel compositions w.r.t. the virtual training set. Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our framework.

[1]  Jun Liu,et al.  Meta Spatio-Temporal Debiasing for Video Scene Graph Generation , 2022, ECCV.

[2]  Lin Geng Foo,et al.  ERA: Expert Retrieval and Assembly for Early Action Prediction , 2022, ECCV.

[3]  Suha Kwak,et al.  ReSTR: Convolution-free Referring Image Segmentation Using Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yunhang Shen,et al.  SeqTR: A Simple yet Universal Network for Visual Grounding , 2022, ECCV.

[5]  Philip H. S. Torr,et al.  LAVT: Language-Aware Vision Transformer for Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Si Liu,et al.  Cross-Modal Progressive Comprehension for Referring Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yudong Chen,et al.  A Survey on Curriculum Learning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Xudong Jiang,et al.  Vision-Language Transformer and Query Generation for Referring Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Zhanxing Zhu,et al.  Adversarial Invariant Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yizhou Yu,et al.  Bottom-Up Shift and Reasoning for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Zhangjie Cao,et al.  MetaSets: Meta-Learning on Point Sets for Generalizable Representations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Huchuan Lu,et al.  Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tieniu Tan,et al.  Locate then Segment: A Strong Pipeline for Referring Image Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Xiaoshuai Sun,et al.  Cascade Grouped Attention Network for Referring Expression Segmentation , 2020, ACM Multimedia.

[17]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[18]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Huchuan Lu,et al.  Bi-Directional Relationship Inferring Network for Referring Image Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[21]  Liujuan Cao,et al.  Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dong Cao,et al.  Learning Meta Face Recognition in Unseen Domains , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ming-Hsuan Yang,et al.  Referring Expression Object Segmentation with Caption-Aware Consistency , 2019, BMVC.

[24]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[26]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[27]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[29]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[31]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[32]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Yongxin Yang,et al.  Learning to Generalize: Meta-Learning for Domain Generalization , 2017, AAAI.

[34]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[36]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[37]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[38]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[39]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[40]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.