Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text

Recent work has described neural-network-based agents that are trained with reinforcement learning (RL) to execute language-like commands in simulated worlds, as a step towards an intelligent agent or robot that can be instructed by human users. However, the optimisation of multi-goal motor policies via deep RL from scratch requires many episodes of experience. Consequently, instruction-following with deep RL typically involves language generated from templates (by an environment simulator), which does not reflect the varied or ambiguous expressions of real users. Here, we propose a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions. By applying our method with a state-of-the-art pre-trained text-based language model (BERT), on tasks requiring agents to identify and position everyday objects relative to other objects in a naturalistic 3D simulated room, we demonstrate substantially-above-chance zero-shot transfer from synthetic template commands to natural instructions given by humans. Our approach is a general recipe for training any deep RL-based system to interface with human users, and bridges the gap between two research directions of notable recent success: agent-centric motor behavior and text-based representation learning.

[1]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[2]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[3]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[4]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[5]  Terry Winograd,et al.  Understanding natural language , 1974 .

[6]  Edward Grefenstette,et al.  RTFM: Generalising to Novel Environment Dynamics via Reading , 2020, ICLR.

[7]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[8]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[9]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[12]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[13]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[14]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[15]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[16]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[17]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  William Chan,et al.  InferLite: Simple Universal Sentence Representations from Natural Language Inference Data , 2018, EMNLP.

[19]  Bhuwan Dhingra,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[20]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[21]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[22]  Yoshua Bengio,et al.  BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop , 2018, ArXiv.

[23]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[24]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[25]  Wei Xu,et al.  Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents , 2018, CoRL.

[26]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[27]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[28]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[29]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[31]  Matthew R. Walter,et al.  A framework for learning semantic maps from grounded natural language descriptions , 2014, Int. J. Robotics Res..

[32]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Sergey Levine,et al.  Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Nicholas Roy,et al.  A Probabilistic Approach for Enabling Robots to Acquire Information from Human Partners Using Language , 2011 .

[38]  Wei Xu,et al.  Interactive Grounded Language Acquisition and Generalization in a 2D World , 2018, ICLR.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Sanja Fidler,et al.  ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning , 2019, ArXiv.

[43]  Matthew R. Walter,et al.  Approaching the Symbol Grounding Problem with Probabilistic Graphical Models , 2011, AI Mag..

[44]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).