Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning

Although virtual agents are increasingly situated in environments where natural language is the most effective mode of interaction with humans, these exchanges are rarely used as an opportunity for learning. Leveraging language interactions effectively requires addressing limitations in the two most common approaches to language grounding: semantic parsers built on top of fixed object categories are precise but inflexible and end-to-end models are maximally expressive, but fickle and opaque. Our goal is to develop a system that balances the strengths of each approach so that users can teach agents new instructions that generalize broadly from a single example. We introduce the idea of neural abstructions: a set of constraints on the inference procedure of a label-conditioned generative model that can affect the meaning of the label in context. Starting from a core programming language that operates over abstructions, users can define increasingly complex mappings from natural language to actions. We show that with this method a user population is able to build a semantic parser for an open-ended house modification task in Minecraft. The semantic parser that results is both flexible and expressive: the percentage of utterances sourced from redefinitions increases steadily over the course of 191 total exchanges, achieving a final value of 28%.

[1]  Zhuoyuan Chen,et al.  CraftAssist: A Framework for Dialogue-enabled Interactive Agents , 2019, ArXiv.

[2]  Noah D. Goodman,et al.  Shaping Visual Representations with Language for Few-Shot Classification , 2019, ACL.

[3]  Dhruv Batra,et al.  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[4]  Luke S. Zettlemoyer,et al.  Bootstrapping Semantic Parsers from Conversations , 2011, EMNLP.

[5]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[6]  Prasoon Goyal,et al.  Using Natural Language for Reward Shaping in Reinforcement Learning , 2019, IJCAI.

[7]  Tom M. Mitchell,et al.  Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[8]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[9]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[10]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[11]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Yoav Artzi,et al.  Interactive Classification by Asking Informative Questions , 2020, ACL.

[14]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[15]  Jason Baldridge,et al.  Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding , 2020, EMNLP.

[16]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[18]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[19]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[20]  Wei Xu,et al.  Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game , 2018, ACL.

[21]  James M. Rehg,et al.  Where Are You? Localization from Embodied Dialog , 2020, EMNLP.

[22]  Ross A. Knepper,et al.  Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight , 2019, CoRL.

[23]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[24]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[25]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[27]  Yonatan Bisk,et al.  Shifting the Baseline: Single Modality Performance on Visual Navigation & QA , 2018, NAACL.

[28]  Pierre-Yves Oudeyer,et al.  Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration , 2020, NeurIPS.

[29]  Jacob Andreas,et al.  Compositional Explanations of Neurons , 2020, NeurIPS.

[30]  Bolei Zhou,et al.  Understanding the role of individual units in a deep neural network , 2020, Proceedings of the National Academy of Sciences.

[31]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[32]  Xinlei Chen,et al.  Order-Aware Generative Modeling Using the 3D-Craft Dataset , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Stefan Lee,et al.  Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition , 2018, CoRL.

[34]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Trevor Darrell,et al.  Object Hallucination in Image Captioning , 2018, EMNLP.

[36]  Thomas L. Griffiths,et al.  Learning Rewards from Linguistic Feedback , 2020, AAAI.

[37]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Armando Solar-Lezama,et al.  Representing Partial Programs with Blended Abstract Semantics , 2020, ArXiv.

[40]  Dorsa Sadigh,et al.  Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[41]  Yoav Artzi,et al.  Executing Instructions in Situated Collaborative Interactions , 2019, EMNLP.

[42]  Louis-Philippe Morency,et al.  Visual Referring Expression Recognition: What Do Systems Actually Learn? , 2018, NAACL.

[43]  Stefan Lee,et al.  Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , 2018, NeurIPS.

[44]  Percy Liang,et al.  From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood , 2017, ACL.

[45]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[46]  Peter Stone,et al.  Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[47]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[48]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[49]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[50]  Christopher D. Manning,et al.  Naturalizing a Programming Language via Interactive Learning , 2017, ACL.

[51]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[52]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[53]  Arthur Szlam,et al.  CraftAssist Instruction Parsing: Semantic Parsing for a Voxel-World Assistant , 2020, ACL.

[54]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Alexander M. Rush,et al.  What is Learned in Visually Grounded Neural Syntax Acquisition , 2020, ACL.

[56]  Xinlei Chen,et al.  CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication , 2017, ACL.

[57]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Nassir Navab,et al.  Guide Me: Interacting with Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[60]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[61]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[62]  Pieter Abbeel,et al.  The MineRL BASALT Competition on Learning from Human Feedback , 2021, ArXiv.

[63]  Yuxin Peng,et al.  Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[65]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[66]  Jacob Andreas,et al.  Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data , 2020, ArXiv.