论文信息 - A Better Loss for Visual-Textual Grounding

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem with heavy and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a model that, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function, based on the classes probabilities, that reaches, in the considered datasets, a better learning balance between the two sub-tasks mentioned above.

Alessandro Sperduti | Luciano Serafini | Davide Rigoni

[1] Zhaohui Zheng,et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.

[2] Leonid Sigal,et al. G3raphGround: Graph-Based Language Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5] Svetlana Lazebnik,et al. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[7] Robinson Piramuthu,et al. Conditional Image-Text Embedding Networks , 2017, ECCV.

[8] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[9] Rada Mihalcea,et al. Structured Matching for Phrase Localization , 2016, ECCV.

[10] Kan Chen,et al. Zero-Shot Grounding of Objects From Natural Language Queries , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Yi Yang,et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning , 2017, ArXiv.

[12] Markus H. Gross,et al. Neural Sequential Phrase Grounding (SeqGROUND) , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ramakant Nevatia,et al. Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14] Lucia Specia,et al. Phrase Localization Without Paired Training Examples , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Jinjun Xiong,et al. Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts , 2018, NIPS.

[16] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[17] Ramakant Nevatia,et al. Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[19] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[20] Julia Hockenmaier,et al. Phrase Grounding by Soft-Label Chain Conditional Random Field , 2019, EMNLP/IJCNLP.

[21] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[24] Xiaogang Wang,et al. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Qi Wu,et al. Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Xuming He,et al. Learning Cross-modal Context Graph for Visual Grounding , 2019, AAAI.

[27] Fang Zhao,et al. Deep Attribute-preserving Metric Learning for Natural Language Object Retrieval , 2017, ACM Multimedia.

[28] Lior Wolf,et al. Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation , 2014, ArXiv.

[29] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[30] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[31] Takayuki Okatani,et al. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Zhou Yu,et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding , 2018, IJCAI.

[33] Zhaohui Zheng,et al. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation , 2020, ArXiv.

[34] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[35] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Jan Kautz,et al. Contrastive Learning for Weakly Supervised Phrase Grounding , 2020, ECCV.

[37] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39] Fang Zhao,et al. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Martin Engilberge,et al. Deep semantic-visual embedding with localization , 2018 .

[41] Jiebo Luo,et al. A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Alessandro Sperduti,et al. Jointly Linking Visual and Textual Entity Mentions with Background Knowledge , 2020, NLDB.

[43] Yong Jae Lee,et al. Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Shih-Fu Chang,et al. Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Shih-Fu Chang,et al. Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] C. Lawrence Zitnick,et al. Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[47] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[48] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Alessandro Sperduti,et al. On Visual-Textual-Knowledge Entity Linking , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[50] Saurabh Singh,et al. Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[52] Ali Farhadi,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.

[53] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Alessandro Sperduti,et al. VTKEL: a resource for visual-textual-knowledge entity linking , 2020, SAC.

[55] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[56] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Qi Wu,et al. Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[59] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Koen E. A. van de Sande,et al. Selective Search for Object Recognition , 2013, International Journal of Computer Vision.