VIMA: Robot Manipulation with Multimodal Prompts

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts , interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We de-sign a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to 2 . 9 × task success rate given the same training data. With 10 × less training data, VIMA still performs 2 . 7 × better than the best competing variant. Code and video demos are available at vimalabs.github.io .

[1]  Yuke Zhu,et al.  Voyager: An Open-Ended Embodied Agent with Large Language Models , 2023, Trans. Mach. Learn. Res..

[2]  Yuchen Lu,et al.  Hyper-Decision Transformer for Efficient Online Policy Adaptation , 2023, ICLR.

[3]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Yecheng Jason Ma,et al.  Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? , 2023, NeurIPS.

[5]  Luca Weihs,et al.  When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning , 2023, ArXiv.

[6]  P. Abbeel,et al.  Foundation Models for Decision Making: Problems, Methods, and Opportunities , 2023, ArXiv.

[7]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[8]  Karol Hausman,et al.  Open-World Object Manipulation using Pre-trained Vision-Language Models , 2023, ArXiv.

[9]  Karol Hausman,et al.  Scaling Robot Learning with Semantically Imagined Experience , 2023, Robotics: Science and Systems.

[10]  Animesh Garg,et al.  Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments , 2023, IEEE Robotics and Automation Letters.

[11]  A. Rajeswaran,et al.  On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline , 2022, ICML.

[12]  P. Abbeel,et al.  Masked Autoencoding for Scalable and Generalizable Decision Making , 2022, NeurIPS.

[13]  P. Stone,et al.  VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors , 2022, ArXiv.

[14]  Lerrel Pinto,et al.  From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data , 2022, ICLR.

[15]  P. Abbeel,et al.  Real-World Robot Learning with Masked Visual Pre-training , 2022, CoRL.

[16]  Yecheng Jason Ma,et al.  VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training , 2022, ICLR.

[17]  D. Fox,et al.  Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, CoRL.

[18]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[19]  Luis F. C. Figueredo,et al.  LATTE: LAnguage Trajectory TransformEr , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[21]  S. Levine,et al.  LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action , 2022, CoRL.

[22]  J. Tenenbaum,et al.  Prompting Decision Transformer for Few-Shot Policy Generalization , 2022, ICML.

[23]  J. Clune,et al.  Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos , 2022, NeurIPS.

[24]  Lerrel Pinto,et al.  Behavior Transformers: Cloning k modes with one stone , 2022, NeurIPS.

[25]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[26]  Anima Anandkumar,et al.  MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge , 2022, NeurIPS.

[27]  David J. Fleet,et al.  A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[28]  Ali Farhadi,et al.  ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , 2022, NeurIPS.

[29]  Juan Carlos Niebles,et al.  Revisiting the “Video” in Video-Language Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  André Susano Pinto,et al.  UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes , 2022, NeurIPS.

[31]  Thomas Kipf,et al.  Simple Open-Vocabulary Object Detection with Vision Transformers , 2022, ArXiv.

[32]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[33]  N. Codella,et al.  i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.

[34]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[35]  Vincent Vanhoucke,et al.  Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[36]  Hyung Won Chung,et al.  What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[37]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[38]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[39]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[40]  Li Fei-Fei,et al.  MetaMorph: Learning Universal Controllers with Transformers , 2022, International Conference on Learning Representations.

[41]  Ilija Radosavovic,et al.  Masked Visual Pre-training for Motor Control , 2022, ArXiv.

[42]  Amy Zhang,et al.  Online Decision Transformer , 2022, ICML.

[43]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[44]  A. Torralba,et al.  Pre-Trained Language Models for Interactive Decision-Making , 2022, NeurIPS.

[45]  S. Gu,et al.  Can Wikipedia Help Offline Reinforcement Learning? , 2022, ArXiv.

[46]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[47]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[49]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[50]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[51]  R. Mottaghi,et al.  Simple but Effective: CLIP Embeddings for Embodied AI , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Shubham Tulsiani,et al.  A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation , 2021, CoRL.

[54]  P. Abbeel,et al.  Towards More Generalizable One-shot Visual Imitation Learning , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[55]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[56]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[57]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[58]  Stefan Schaal,et al.  Multi-Task Learning with Sequence-Conditioned Transporter Networks , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[59]  Quoc V. Le,et al.  Multi-Task Self-Training for Learning General Representations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Angela P. Schoellig,et al.  Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning , 2021, Annu. Rev. Control. Robotics Auton. Syst..

[61]  Silvio Savarese,et al.  What Matters in Learning from Offline Human Demonstrations for Robot Manipulation , 2021, CoRL.

[62]  Silvio Savarese,et al.  BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments , 2021, CoRL.

[63]  Silvio Savarese,et al.  iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks , 2021, CoRL.

[64]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[65]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[66]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[67]  Li Fei-Fei,et al.  SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies , 2021, ICML.

[68]  Xipeng Qiu,et al.  A Survey of Transformers , 2021, AI Open.

[69]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[70]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[71]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[72]  Doina Precup,et al.  AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[73]  Minghao Gou,et al.  OCRTOC: A Cloud-Based Competition and Benchmark for Robotic Grasping and Manipulation , 2021, IEEE Robotics and Automation Letters.

[74]  Roozbeh Mottaghi,et al.  ManipulaTHOR: A Framework for Visual Object Manipulation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Roozbeh Mottaghi,et al.  Visual Room Rearrangement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Joshua B. Tenenbaum,et al.  The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[77]  Cheston Tan,et al.  A Survey of Embodied AI: From Simulators to Research Tasks , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[78]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[79]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[80]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[81]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[82]  Lyne P. Tchapmi,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[83]  Sudeep Dasari,et al.  Transformers for One-Shot Visual Imitation , 2020, CoRL.

[84]  Roozbeh Mottaghi,et al.  Rearrangement: A Challenge for Embodied AI , 2020, ArXiv.

[85]  Stuart J. Russell,et al.  The MAGICAL Benchmark for Robust Imitation , 2020, NeurIPS.

[86]  Brijen Thananjeyan,et al.  Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones , 2020, IEEE Robotics and Automation Letters.

[87]  Nicholas Rhinehart,et al.  Conservative Safety Critics for Exploration , 2020, ICLR.

[88]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[89]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[90]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[91]  Chitta Baral,et al.  Language-Conditioned Imitation Learning for Robot Manipulation Tasks , 2020, NeurIPS.

[92]  Yoshua Bengio,et al.  CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning , 2020, ICLR.

[93]  Yuke Zhu,et al.  robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , 2020, ArXiv.

[94]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[95]  Stephen Clark,et al.  Grounded Language Learning Fast and Slow , 2020, ICLR.

[96]  Khashayar Rohanimanesh,et al.  Self-Supervised Goal-Conditioned Pick and Place , 2020, ArXiv.

[97]  Torsten Kröger,et al.  Self-Supervised Learning for Precise Pick-and-Place Without Object Model , 2020, IEEE Robotics and Automation Letters.

[98]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[99]  Corey Lynch,et al.  Language Conditioned Imitation Learning Over Unstructured Data , 2020, Robotics: Science and Systems.

[100]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[101]  S. Gelly,et al.  Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[102]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[103]  Andy Zeng,et al.  Grasping in the Wild: Learning 6DoF Closed-Loop Grasping From Low-Cost Demonstrations , 2019, IEEE Robotics and Automation Letters.

[104]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[107]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[108]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[109]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[110]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[111]  Silvio Savarese,et al.  SURREAL-System: Fully-Integrated Stack for Distributed Deep Reinforcement Learning , 2019, ArXiv.

[112]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[113]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[114]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[115]  Juan Carlos Niebles,et al.  Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[116]  Oliver Kroemer,et al.  Graph-Structured Visual Imitation , 2019, CoRL.

[117]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[118]  Jitendra Malik,et al.  Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies , 2018 .

[119]  Silvio Savarese,et al.  SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark , 2018, CoRL.

[120]  Sergio Gomez Colmenarejo,et al.  One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL , 2018, ArXiv.

[121]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[122]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[123]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[124]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[125]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[126]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[127]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[128]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[129]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[130]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[131]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[132]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[133]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[134]  Avinash C. Kak,et al.  Real-time tracking and pose estimation for industrial objects using geometric features , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[135]  E. Markman,et al.  Word learning in children: an examination of fast mapping. , 1987, Child development.

[136]  C. K. Liu,et al.  BEHAVIOR-1K: A Benchmark for Embodied AI with 1, 000 Everyday Activities and Realistic Simulation , 2022, CoRL.

[137]  P. Abbeel,et al.  Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models , 2022, ArXiv.

[138]  Max Jaderberg,et al.  Open-Ended Learning Leads to Generally Capable Agents , 2021, ArXiv.

[139]  David Howard,et al.  A Review of Physics Simulators for Robotic Applications , 2021, IEEE Access.

[140]  Pulkit Agrawal,et al.  The Task Specification Problem , 2021, CoRL.

[141]  Gregory D. Hager,et al.  Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions , 2021, CoRL.

[142]  Yejin Choi,et al.  Multimodal Neural Script Knowledge Models , 2021 .

[143]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[144]  J. Charles,et al.  A Sino-German λ 6 cm polarization survey of the Galactic plane I . Survey strategy and results for the first survey region , 2006 .

[145]  Sonia Chernova,et al.  Recent Advances in Robot Learning from Demonstration , 2020, Annu. Rev. Control. Robotics Auton. Syst..