Generalization in Instruction Following Systems

Understanding and executing natural language instructions in a grounded domain is one of the hallmarks of artificial intelligence. In this paper, we focus on instruction understanding in the blocks world domain and investigate the language understanding abilities of two top-performing systems for the task. We aim to understand if the test performance of these models indicates an understanding of the spatial domain and of the natural language instructions relative to it, or whether they merely over-fit spurious signals in the dataset. We formulate a set of expectations one might have from an instruction following model and concretely characterize the different dimensions of robustness such a model should possess. Despite decent test performance, we find that state-of-the-art models fall short of these expectations and are extremely brittle. We then propose a learning strategy that involves data augmentation and show through extensive experiments that the proposed learning strategy yields models that are competitive on the original test set while satisfying our expectations much better.

[1]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[2]  Daniel Marcu,et al.  Learning Interpretable Spatial Operations in a Rich 3D Blocks World , 2017, AAAI.

[3]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[4]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[5]  Chuanrong Li,et al.  Linguistically-Informed Transformations (LIT): A Method for Automatically Generating Contrast Sets , 2020, BLACKBOXNLP.

[6]  Julia Hockenmaier,et al.  Collaborative Dialogue in Minecraft , 2019, ACL.

[7]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[8]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[9]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[10]  Dan Goldwasser,et al.  Improving Natural Language Interaction with Robots Using Advice , 2019, NAACL.

[11]  Dan Roth,et al.  What do we expect from Multiple-choice QA Systems? , 2020, FINDINGS.

[12]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[13]  Hao Tan,et al.  Source-Target Inference Models for Spatial Instruction Understanding , 2017, AAAI.

[14]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Zachary Chase Lipton,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[16]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[17]  Terry Winograd,et al.  Understanding natural language , 1974 .

[18]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[19]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[20]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[21]  Daniel Marcu,et al.  Natural Language Communication with Robots , 2016, NAACL.

[22]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[23]  Aman Jantan,et al.  State-of-the-art in artificial neural network applications: A survey , 2018, Heliyon.