论文信息 - Grounding Language with Visual Affordances over Unstructured Data

Grounding Language with Visual Affordances over Unstructured Data

Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach to efficiently learn general-purpose language-conditioned robot skills from unstructured, offline and reset-free data in the real world by exploiting a self-supervised visuo-lingual affordance model, which requires annotating as little as 1% of the total data with language. We evaluate our method in extensive experiments both in simulated and real-world robotic tasks, achieving state-of-the-art performance on the challenging CALVIN benchmark and learning over 25 distinct visuomotor manipulation tasks with a single policy in the real world. We find that when paired with LLMs to break down abstract natural language instructions into subgoals via few-shot prompting, our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches. Code and videos are available at http://hulc2.cs.uni-freiburg.de

Jessica Borja-Diaz | Oier Mees | W. Burgard | Wolfram Burgard

[1] J. Boedecker,et al. Latent Plans for Task-Agnostic Offline Reinforcement Learning , 2022, CoRL.

[2] Peter R. Florence,et al. Code as Policies: Language Model Programs for Embodied Control , 2022, ArXiv.

[3] Peter R. Florence,et al. Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[4] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[5] Oier Mees,et al. What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[6] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[7] Adrian S. Wong,et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[8] Vikash Kumar,et al. R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[9] W. Burgard,et al. Affordance Learning from Play for Sample-Efficient Policy Learning , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[10] W. Burgard,et al. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[11] Dieter Fox,et al. StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[12] Sergey Levine,et al. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[13] Tamara von Glehn,et al. Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning , 2021, ArXiv.

[14] Dieter Fox,et al. CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[15] D. Fox,et al. SORNet: Spatial Object-Centric Representations for Sequential Manipulation , 2021, CoRL.

[16] S. Savarese,et al. Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[17] Nanning Zheng,et al. INVIGORATE: Interactive Visual Grounding and Grasping in Clutter , 2021, Robotics: Science and Systems.

[18] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[19] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[21] Corey Lynch,et al. Language Conditioned Imitation Learning Over Unstructured Data , 2020, Robotics: Science and Systems.

[22] W. Burgard,et al. Composing Pick-and-Place Tasks By Grounding Language , 2021, ISER.

[23] Ross A. Knepper,et al. Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following , 2020, CoRL.

[24] Peter R. Florence,et al. Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[25] Stefanie Tellex,et al. Robot Object Retrieval with Contextual Natural Language Queries , 2020, Robotics: Science and Systems.

[26] Hadas Kress-Gazit,et al. Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[27] Jacob Andreas,et al. Experience Grounds Language , 2020, EMNLP.

[28] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[29] S. Levine,et al. Learning Latent Plans from Play , 2019, CoRL.

[30] Andrew Bennett,et al. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[31] Mohit Shridhar,et al. Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[32] Kuniyuki Takahashi,et al. Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33] Ian Taylor,et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34] Darwin G. Caldwell,et al. AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[36] Nicholas Roy,et al. Efficient Grounding of Abstract Spatial Concepts for Natural Language Interaction with Robot Manipulators , 2016, Robotics: Science and Systems.

[37] Matthias Schroder. Mind Children The Future Of Robot And Human Intelligence , 2016 .

[38] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.