Improving Policy Learning via Language Dynamics Distillation

Recent work has shown that augmenting environments with language descriptions improves policy learning. However, for environments with complex language abstractions, learning how to ground language to observations is difficult due to sparse, delayed rewards. We propose Language Dynamics Distillation ( LDD ), which pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning (RL). In this way, the model is trained to both maximize expected reward and retain knowledge about how language relates to environment dynamics. On SILG, a benchmark of five tasks with language descriptions that evaluate distinct generalization challenges on unseen environments (NetHack, ALFWorld, RTFM, Messenger, and Touchdown), LDD outperforms tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demonstrations in inverse RL and reward shaping with pretrained experts. In our analyses, we show that language descriptions in demonstrations improve sample-efficiency and generalization across environments, and that dynamics modeling with expert demonstrations is more effective than with non-experts.

[1]  Andrew Kyle Lampinen,et al.  Semantic Exploration from Language Abstractions and Pretrained Representations , 2022, ArXiv.

[2]  Luke Zettlemoyer,et al.  SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark , 2021, ArXiv.

[3]  Karthik Narasimhan,et al.  Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning , 2021, ICML.

[4]  Matthew J. Hausknecht,et al.  ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.

[5]  Jason Baldridge,et al.  Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding , 2020, EMNLP.

[6]  Edward Grefenstette,et al.  The NetHack Learning Environment , 2020, NeurIPS.

[7]  Tim Rocktäschel,et al.  RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments , 2020, ICLR.

[8]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Edward Grefenstette,et al.  RTFM: Generalising to Novel Environment Dynamics via Reading , 2019, ArXiv.

[10]  Peter Stone,et al.  Stochastic Grounded Action Transformation for Robot Learning in Simulation , 2017, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Fuchun Sun,et al.  Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement , 2019, NeurIPS.

[12]  Edward Grefenstette,et al.  TorchBeast: A PyTorch Platform for Distributed RL , 2019, ArXiv.

[13]  Mo Yu,et al.  Hybrid Reinforcement Learning with Expert State Sequences , 2019, AAAI.

[14]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[16]  Yannick Schroecker,et al.  Imitating Latent Policies from Observation , 2018, ICML.

[17]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[18]  Derek Chen,et al.  Decoupling Strategy and Generation in Negotiation Dialogues , 2018, EMNLP.

[19]  Ryuki Tachibana,et al.  Internal Model from Observations for Reward Shaping , 2018, ArXiv.

[20]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[21]  Percy Liang,et al.  Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , 2018, ICLR.

[22]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[23]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Yuval Tassa,et al.  Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[25]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[26]  Regina Barzilay,et al.  Language Understanding for Text-based Games using Deep Reinforcement Learning , 2015, EMNLP.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..