Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMSoperation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref-NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS.

[1]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[2]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[3]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Qi Qian,et al.  Learning to Rank Proposals for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yuning Jiang,et al.  Acquisition of Localization Confidence for Accurate Object Detection , 2018, ECCV.

[7]  John F. Canny,et al.  Grounding Human-To-Vehicle Advice for Self-Driving Vehicles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaogang Wang,et al.  Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Lianli Gao,et al.  Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[12]  Ramakant Nevatia,et al.  Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Huchuan Lu,et al.  Bi-Directional Relationship Inferring Network for Referring Image Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Markus H. Gross,et al.  Neural Sequential Phrase Grounding (SeqGROUND) , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Margrit Betke,et al.  Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes , 2019, ECCV.

[19]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Bernt Schiele,et al.  Learning Non-maximum Suppression , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhiwu Lu,et al.  Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Chen Qian,et al.  A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Zheng-Jun Zha,et al.  Joint Visual Grounding with Language Scene Graphs , 2019 .

[24]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lars Petersson,et al.  Improving Object Localization with Fitness NMS and Bounded IoU Loss , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Richang Hong,et al.  Learning to Compose and Reason with Language Tree Structures for Visual Grounding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Lin Ma,et al.  Real-Time Referring Expression Comprehension by Single-Stage Grounding Network , 2018, ArXiv.

[30]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jian Shao,et al.  Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.

[32]  Qi Wu,et al.  Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[34]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[35]  Yizhou Yu,et al.  Dynamic Graph Attention for Referring Expression Comprehension , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Shih-Fu Chang,et al.  Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Jiebo Luo,et al.  Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.

[40]  Liujuan Cao,et al.  Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[42]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.

[43]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yizhou Yu,et al.  Graph-Structured Referring Expression Reasoning in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[46]  Yaser Al-Onaizan,et al.  Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions , 2020, ACL.

[47]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Larry S. Davis,et al.  Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Shih-Fu Chang,et al.  Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[51]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[52]  Yunhong Wang,et al.  Adaptive NMS: Refining Pedestrian Detection in a Crowd , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Hongliang Li,et al.  Key-Word-Aware Network for Referring Expression Image Segmentation , 2018, ECCV.

[55]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[57]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Zhou Yu,et al.  Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding , 2018, IJCAI.