vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

Abstract As a challenging cross-modal task, current visual grounding is usually addressed by directly analyzing the unstructured scene and matching the query text with all region proposals, which is prone to errors, especially when the scene and/or query text are complex. In this paper, we study such complex visual grounding problem and propose to build a query dependent visual-textual (VT) scene graph to jointly understand the image and query text. To avoid the difficulty of obtaining ground-truth scene graphs, we propose vtGraphNet to effectively learn the bi-modal scene graph in a weakly-supervised way, where the only supervision is the manually annotated grounding region. Specifically, we first use an ARU Tagging model to sequentially tag every query word as either an attribute, a relationship or an auxiliary. If a word is tagged as attribute, we develop an attribute-assigning model to associate it to a region proposal. If a word is tagged as relationship, we develop a relationship-referring model to associate it to a pair of region proposals. A simple yet effective graph consistency loss function is constructed to constrain the above associations to form a feasible compact VT scene graph, from which discriminative region features can be extracted and used to locate the grounding object by classification. Extensive experiments on benchmark datasets validate the superiority of our approach in handling both simple and complex visual grounding tasks.

[1]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[2]  Jayant Krishnamurthy,et al.  Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Ramakant Nevatia,et al.  Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Qi Wu,et al.  Attend and Imagine: Multi-Label Image Classification With Visual Attention and Recurrent Neural Networks , 2019, IEEE Transactions on Multimedia.

[6]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[7]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Qiming Fu,et al.  Multi-label Image Classification via Coarse-to-Fine Attention* , 2019 .

[9]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[10]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[11]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[13]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Qi Wu,et al.  Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Zhi-Qiang Liu,et al.  Self-Validated Labeling of Markov Random Fields for Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Rachid Alami,et al.  Situation assessment for human-robot interactive object manipulation , 2011, 2011 RO-MAN.

[20]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[21]  Jinchang Ren,et al.  SR-GAN: Semantic Rectifying Generative Adversarial Network for Zero-shot Learning , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[22]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[23]  Victor S. Sheng,et al.  Coarse to Fine: Multi-label Image Classification with Global/Local Attention , 2018, 2018 IEEE International Smart Cities Conference (ISC2).

[24]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[25]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[26]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Shih-Fu Chang,et al.  Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[29]  Jianwu Dang,et al.  Unsupervised measure of Chinese lexical semantic similarity using correlated graph model for news story segmentation , 2018, Neurocomputing.

[30]  Stephen Clark,et al.  A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model , 2010, EMNLP.

[31]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[32]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Liang Wang,et al.  Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Lucia Specia,et al.  Phrase Localization Without Paired Training Examples , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Zhi-Qiang Liu,et al.  Region-Level Image Authentication Using Bayesian Structural Content Abstraction , 2008, IEEE Transactions on Image Processing.

[42]  Jinchang Ren,et al.  DAU-GAN: Unsupervised Object Transfiguration via Deep Attention Unit , 2018, BICS.

[43]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[45]  Fang Zhao,et al.  Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  W. T. Singleton,et al.  Man-machine systems , 1974 .

[47]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[48]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[49]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[50]  Thomas B. Sheridan,et al.  Man-machine systems;: Information, control, and decision models of human performance , 1974 .

[51]  Michael S. Bernstein,et al.  Referring Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[53]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[55]  Pushmeet Kohli,et al.  Minimizing sparse higher order energy functions of discrete variables , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .

[58]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jianwu Dang,et al.  Story co-segmentation of Chinese broadcast news using weakly-supervised semantic similarity , 2019, Neurocomputing.

[61]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.