Scene Graph Modification Based on Natural Language Commands

Structured representations like graphs and parse trees play a crucial role in many Natural Language Processing systems. In recent years, the advancements in multi-turn user interfaces necessitate the need for controlling and updating these structured representations given new sources of information. Although there have been many efforts focusing on improving the performance of the parsers that map text to graphs or parse trees, very few have explored the problem of directly manipulating these representations. In this paper, we explore the novel problem of graph modification, where the systems need to learn how to update an existing scene graph given a new user's command. Our novel models based on graph-based sparse transformer and cross attention information fusion outperform previous systems adapted from the machine translation and graph generation literature. We further contribute our large graph modification datasets to the research community to encourage future research for this new problem.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Razvan Pascanu,et al.  Learning Deep Generative Models of Graphs , 2018, ICLR 2018.

[4]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[5]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[6]  Gholamreza Haffari,et al.  Graph-to-Sequence Learning using Gated Graph Neural Networks , 2018, ACL.

[7]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[8]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Junru Zhou,et al.  Head-Driven Phrase Structure Grammar Parsing on Penn Treebank , 2019, ACL.

[10]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[11]  Fei Liu,et al.  Dialog state tracking, a machine reading approach using Memory Network , 2016, EACL.

[12]  Yue Zhang,et al.  A Graph-to-Sequence Model for AMR-to-Text Generation , 2018, ACL.

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Amos Azaria,et al.  Parsing Natural Language Conversations using Contextual Cues , 2017, IJCAI.

[16]  Chenxi Liu,et al.  Scene Graph Parsing as Dependency Parsing , 2018, NAACL.

[17]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[18]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[19]  Jure Leskovec,et al.  GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models , 2018, ICML.

[20]  Nikos Komodakis,et al.  GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders , 2018, ICANN.

[21]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[22]  Wei Lu,et al.  Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning , 2019, TACL.

[23]  Yoav Artzi,et al.  Learning to Map Context-Dependent Sentences to Executable Formal Queries , 2018, NAACL.

[24]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[25]  Kallirroi Georgila,et al.  Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing , 2018, LREC.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.

[29]  Gholamreza Haffari,et al.  A Pointer Network Architecture for Context-Dependent Semantic Parsing , 2019, ALTA.

[30]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[31]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[32]  Yoshua Bengio,et al.  Montreal Neural Machine Translation Systems for WMT’15 , 2015, WMT@EMNLP.

[33]  Ming-Wei Chang,et al.  Search-based Neural Structured Learning for Sequential Question Answering , 2017, ACL.

[34]  Phil Blunsom,et al.  Robust Incremental Neural Semantic Graph Parsing , 2017, ACL.

[35]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[36]  Khalil Sima'an,et al.  Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[37]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[38]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[39]  Lu Chen,et al.  Towards Universal Dialogue State Tracking , 2018, EMNLP.

[40]  Deng Cai,et al.  Graph Transformer for Graph-to-Sequence Learning , 2019, AAAI.

[41]  Franck Dernoncourt,et al.  Rethinking Self-Attention: An Interpretable Self-Attentive Encoder-Decoder Parser , 2019, ArXiv.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.