Distilling Internet-Scale Vision-Language Models into Embodied Agents

Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

[1]  R. Fergus,et al.  Collaborating with language models for embodied reasoning , 2023, ArXiv.

[2]  S. Levine,et al.  Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models , 2022, ArXiv.

[3]  Luke Zettlemoyer,et al.  Improving Policy Learning via Language Dynamics Distillation , 2022, NeurIPS.

[4]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[5]  S. Levine,et al.  LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action , 2022, CoRL.

[6]  Dhruv Batra,et al.  ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings , 2022, NeurIPS.

[7]  Anima Anandkumar,et al.  MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge , 2022, NeurIPS.

[8]  Petko Georgiev,et al.  Intra-agent speech permits zero-shot task acquisition , 2022, NeurIPS.

[9]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[10]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[11]  Andrew Kyle Lampinen,et al.  Semantic Exploration from Language Abstractions and Pretrained Representations , 2022, NeurIPS.

[12]  A. Dragan,et al.  Inferring Rewards from Language in Context , 2022, ACL.

[13]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[14]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[15]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[16]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[17]  A. Torralba,et al.  Skill Induction and Planning with Latent Language , 2021, ACL.

[18]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[19]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[20]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[21]  Patrick Shafto,et al.  Interactive Learning from Activity Description , 2021, ICML.

[22]  A. Gupta,et al.  Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning , 2020, ICLR.

[23]  Mark K. Ho,et al.  Learning Rewards from Linguistic Feedback , 2020, AAAI.

[24]  Stephen Clark,et al.  Grounded Language Learning Fast and Slow , 2020, ICLR.

[25]  Rita Cucchiara,et al.  Explore and Explain: Self-supervised Navigation and Recounting , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[26]  Sida I. Wang,et al.  SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark , 2021, NeurIPS.

[27]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[28]  O. Pietquin,et al.  HIGhER: Improving instruction following with Hindsight Generation for Experience Replay , 2020, 2020 IEEE Symposium Series on Computational Intelligence (SSCI).

[29]  Ross A. Knepper,et al.  Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following , 2020, CoRL.

[30]  Pierre Sermanet,et al.  Grounding Language in Play , 2020, ArXiv.

[31]  Hadas Kress-Gazit,et al.  Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[32]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Edward Grefenstette,et al.  RTFM: Generalising to Novel Environment Dynamics via Reading , 2019, ArXiv.

[34]  Ross A. Knepper,et al.  Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight , 2019, CoRL.

[35]  Sanja Fidler,et al.  ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning , 2019, ArXiv.

[36]  John DeNero,et al.  Guiding Policies with Language via Meta-Learning , 2018, ICLR.

[37]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Shaohua Yang,et al.  Language to Action: Towards Interactive Task Learning with Physical Agents , 2018, IJCAI.

[40]  Wei Xu,et al.  Interactive Grounded Language Acquisition and Generalization in a 2D World , 2018, ICLR.

[41]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Regina Barzilay,et al.  Grounding Language for Transfer in Deep Reinforcement Learning , 2017, J. Artif. Intell. Res..

[43]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[44]  Peter Stone,et al.  Opportunistic Active Learning for Grounding Natural Language Descriptions , 2017, CoRL.

[45]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[46]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[48]  Yunyi Jia,et al.  Teaching Robots New Actions through Natural Language Instructions , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[49]  John E. Laird,et al.  Learning Goal-Oriented Hierarchical Tasks from Situated Interactive Instruction , 2014, AAAI.

[50]  Manuel Lopes,et al.  Active Learning for Teaching a Robot Grounded Relational Symbols , 2013, IJCAI.

[51]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[52]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[53]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[54]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[55]  Regina Barzilay,et al.  Learning to Win by Reading Manuals in a Monte-Carlo Framework , 2011, ACL.

[56]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[57]  Raymond J. Mooney,et al.  Learning to Connect Language and Perception , 2008, AAAI.

[58]  S. Harnad Symbol grounding problem , 1990, Scholarpedia.

[59]  Terry Winograd,et al.  Understanding natural language , 1974 .