Grounding Language with Visual Affordances over Unstructured Data

Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach to efficiently learn general-purpose language-conditioned robot skills from unstructured, offline and reset-free data in the real world by exploiting a self-supervised visuo-lingual affordance model, which requires annotating as little as 1% of the total data with language. We evaluate our method in extensive experiments both in simulated and real-world robotic tasks, achieving state-of-the-art performance on the challenging CALVIN benchmark and learning over 25 distinct visuomotor manipulation tasks with a single policy in the real world. We find that when paired with LLMs to break down abstract natural language instructions into subgoals via few-shot prompting, our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches. Code and videos are available at http://hulc2.cs.uni-freiburg.de

[1]  J. Boedecker,et al.  Latent Plans for Task-Agnostic Offline Reinforcement Learning , 2022, CoRL.

[2]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, ArXiv.

[3]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[4]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[5]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[6]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[7]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[8]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[9]  W. Burgard,et al.  Affordance Learning from Play for Sample-Efficient Policy Learning , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[10]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[11]  Dieter Fox,et al.  StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[12]  Sergey Levine,et al.  BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[13]  Tamara von Glehn,et al.  Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning , 2021, ArXiv.

[14]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[15]  D. Fox,et al.  SORNet: Spatial Object-Centric Representations for Sequential Manipulation , 2021, CoRL.

[16]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[17]  Nanning Zheng,et al.  INVIGORATE: Interactive Visual Grounding and Grasping in Clutter , 2021, Robotics: Science and Systems.

[18]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[21]  Corey Lynch,et al.  Language Conditioned Imitation Learning Over Unstructured Data , 2020, Robotics: Science and Systems.

[22]  W. Burgard,et al.  Composing Pick-and-Place Tasks By Grounding Language , 2021, ISER.

[23]  Ross A. Knepper,et al.  Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following , 2020, CoRL.

[24]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[25]  Stefanie Tellex,et al.  Robot Object Retrieval with Contextual Natural Language Queries , 2020, Robotics: Science and Systems.

[26]  Hadas Kress-Gazit,et al.  Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[27]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[28]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[29]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[30]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[31]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[32]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Ian Taylor,et al.  Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[36]  Nicholas Roy,et al.  Efficient Grounding of Abstract Spatial Concepts for Natural Language Interaction with Robot Manipulators , 2016, Robotics: Science and Systems.

[37]  Matthias Schroder Mind Children The Future Of Robot And Human Intelligence , 2016 .

[38]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.