Scene Graph Parsing as Dependency Parsing

In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  James Cross,et al.  Incremental Parsing with Minimal Features Using Bi-Directional LSTM , 2016, ACL.

[3]  Joakim Nivre,et al.  Incrementality in Deterministic Dependency Parsing , 2004 .

[4]  Yejin Choi,et al.  Neural AMR: Sequence-to-Sequence Models for Parsing and Generation , 2017, ACL.

[5]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[6]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[8]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[9]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[10]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[16]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[19]  Christopher D. Manning,et al.  Robust Subgraph Generation Improves Abstract Meaning Representation Parsing , 2015, ACL.

[20]  Jayant Krishnamurthy,et al.  Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[21]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[22]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[23]  Chuan Wang,et al.  A Transition-based Algorithm for AMR Parsing , 2015, NAACL.

[24]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Caption Regions , 2017, ArXiv.

[25]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[28]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[29]  Phil Blunsom,et al.  Robust Incremental Neural Semantic Graph Parsing , 2017, ACL.

[30]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Claire Cardie,et al.  Going out on a limb: Joint Extraction of Entity Mentions and Relations without Dependency Trees , 2017, ACL.

[32]  Lillian Lee,et al.  Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set , 2017, EMNLP.

[33]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[34]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[35]  Ari Rappoport,et al.  A Transition-Based Directed Acyclic Graph Parser for UCCA , 2017, ACL.

[36]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[39]  Jaime G. Carbonell,et al.  A Discriminative Graph-Based Parser for the Abstract Meaning Representation , 2014, ACL.

[40]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[41]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[42]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[44]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[45]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[46]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.