What is Learned in Visually Grounded Neural Syntax Acquisition

Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance. Contrary to what the model might be capable of learning, we find significantly less expressive versions produce similar predictions and perform just as well, or even better. We also find that a simple lexical signal of noun concreteness plays the main role in the model's predictions as opposed to more complex syntactic reasoning.

[1]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[2]  Chris Dyer,et al.  A Critical Analysis of Biased Parsers in Unsupervised Parsing , 2019, ArXiv.

[3]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[4]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[5]  Bart de Boer,et al.  The Atoms of Language: The Mind's Hidden Rules of Grammar; Foundations of Language: Brain, Meaning, Grammar, Evolution , 2002, Artificial Life.

[6]  Dhruv Batra,et al.  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[7]  Alexander M. Rush,et al.  Unsupervised Recurrent Neural Network Grammars , 2019, NAACL.

[8]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[9]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[10]  Ashish Vaswani,et al.  Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation , 2019, ACL.

[11]  Armand Joulin,et al.  Cooperative Learning of Disjoint Syntax and Semantics , 2019, NAACL.

[12]  Mohit Yadav,et al.  Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders , 2019, NAACL.

[13]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Aaron C. Courville,et al.  Neural Language Modeling by Jointly Learning Syntax and Lexicon , 2017, ICLR.

[15]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David M. Mimno,et al.  Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets , 2018, NAACL.

[17]  Samuel R. Bowman,et al.  Do latent tree learning models identify meaningful structure in sentences? , 2017, TACL.

[18]  Ross A. Knepper,et al.  Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight , 2019, CoRL.

[19]  Mark C. Baker The atoms of language: The mind''s hidden rules of grammar , 1987 .

[20]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[21]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[22]  Ross A. Knepper,et al.  Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning , 2018, Robotics: Science and Systems.

[23]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[24]  Samuel R. Bowman,et al.  Grammar Induction with Neural Language Models: An Unusual Replication , 2018, EMNLP.

[25]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  C. Constantinidis,et al.  Bottom-Up and Top-Down Attention , 2014, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[27]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[28]  Rada Mihalcea,et al.  Structured Matching for Phrase Localization , 2016, ECCV.

[29]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[30]  Kevin Gimpel,et al.  Visually Grounded Neural Syntax Acquisition , 2019, ACL.

[31]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[32]  Louis-Philippe Morency,et al.  Visual Referring Expression Recognition: What Do Systems Actually Learn? , 2018, NAACL.

[33]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.