Learning to Understand Goal Specifications by Modelling Reward

Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, this places on environment designers the onus of designing language-conditional reward functions which may not be easily or tractably implemented as the complexity of the environment and the language scales. To overcome this limitation, we present a framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples. As reward models improve, they learn to accurately reward agents for completing tasks for environment configurations---and for instructions---not present amongst the expert data. This framework effectively separates the representation of what instructions require from how they can be executed. In a simple grid world, it enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new expert examples.

[1]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[2]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[3]  Simon Brodeur,et al.  HoME: a Household Multimodal Environment , 2017, ICLR.

[4]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[5]  Stefanie Tellex,et al.  Learning to Parse Natural Language to Grounded Reward Functions with Weak Supervision , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Sergey Levine,et al.  From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.

[7]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[8]  Terry Winograd,et al.  Understanding natural language , 1974 .

[9]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[10]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[11]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[12]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[13]  Smaranda Muresan,et al.  Grounding English Commands to Reward Functions , 2015, Robotics: Science and Systems.

[14]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[16]  Peter Stone,et al.  Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces , 2017, AAAI.

[17]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[18]  Dan Klein,et al.  Alignment-Based Compositional Semantics for Instruction Following , 2015, EMNLP.

[19]  Jitendra Malik,et al.  Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Regina Barzilay,et al.  Representation Learning for Grounded Spatial Reasoning , 2017, TACL.

[21]  Daniel Jurafsky,et al.  Learning to Follow Navigational Directions , 2010, ACL.

[22]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[23]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[24]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[25]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[26]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[27]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[28]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[29]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[30]  Oriol Vinyals,et al.  Synthesizing Programs for Images using Reinforced Adversarial Learning , 2018, ICML.

[31]  Terry Winograd,et al.  Procedures As A Representation For Data In A Computer Program For Understanding Natural Language , 1971 .

[32]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[33]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[34]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[35]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.