Using Syntax to Ground Referring Expressions in Natural Images

We introduce GroundNet, a neural network for referring expression recognition -- the task of localizing (or grounding) in an image the object referred to by a natural language expression. Our approach to this task is the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph. Given a parse tree for an input expression, we explicitly map the syntactic constituents and relationships present in the tree to a composed graph of neural modules that defines our architecture for performing localization. This syntax-based approach aids localization of \textit{both} the target object and auxiliary supporting objects mentioned in the expression. As a result, GroundNet is more interpretable than previous methods: we can (1) determine which phrase of the referring expression points to which object in the image and (2) track how the localization of the target object is determined by the network. We study this property empirically by introducing a new set of annotations on the GoogleRef dataset to evaluate localization of supporting objects. Our experiments show that GroundNet achieves state-of-the-art accuracy in identifying supporting objects, while maintaining comparable performance in the localization of target objects.

[1]  Michelle X. Zhou,et al.  A probabilistic approach to reference resolution in multimodal user interfaces , 2004, IUI '04.

[2]  Michael Beetz,et al.  What are you talking about? Grounding dialogue in a perspective-aware robotic architecture , 2011, 2011 RO-MAN.

[3]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[4]  Sylwia Ozdowska Cross-Corpus Evaluation of Word Alignment , 2008, LREC.

[5]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[6]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  João Graça,et al.  Building a Golden Collection of Parallel Multi-Language Word Alignment , 2008, LREC.

[8]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[9]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[10]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[11]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Joyce Yue Chai,et al.  Integrating word acquisition and referential grounding towards physical world interaction , 2012, ICMI '12.

[13]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[14]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[17]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Siobhan Chapman Logic and Conversation , 2005 .

[19]  Dan Klein,et al.  A Game-Theoretic Approach to Generating Spatial Descriptions , 2010, EMNLP.

[20]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[21]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[23]  Matthias Scheutz,et al.  Situated open world reference resolution for human-robot dialogue , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[24]  Ivana Kruijff-Korbayová,et al.  Situated resolution and generation of spatial referring expressions for robotic assistants , 2009, IJCAI 2009.

[25]  Shimon Ullman,et al.  Do You See What I Mean? Visual Resolution of Linguistic Ambiguities , 2015, EMNLP.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Gordon Christie,et al.  Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes , 2016, EMNLP.

[28]  Deb Roy,et al.  Grounded Semantic Composition for Visual Scenes , 2011, J. Artif. Intell. Res..

[29]  Matthew R. Walter,et al.  Approaching the Symbol Grounding Problem with Probabilistic Graphical Models , 2011, AI Mag..

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[33]  Rada Mihalcea,et al.  Structured Matching for Phrase Localization , 2016, ECCV.