Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

: Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with P ER A CT , a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. P ER A CT encodes language goals and RGB-D voxel observations with a Perceiver Transformer [1], and outputs discretized actions by “detecting the next best voxel action”. Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that P ER A CT significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks. Analysis. Success rates (mean %) of various P ER A CT agents trained with 100 demonstrations per task. We investigate three factors that affect P ER A CT ’s performance: rotation augmentation, number of Perceiver latents, and voxel resolution.

[1]  P. Abbeel,et al.  On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning , 2022, 2206.03271.

[2]  Ian S. Fischer,et al.  Multi-Game Decision Transformers , 2022, NeurIPS.

[3]  Trevor Darrell,et al.  Voxel-informed Language Grounding , 2022, ACL.

[4]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[5]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[6]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[7]  P. Abbeel,et al.  Coarse-to-Fine Q-attention with Learned Path Ranking , 2022, ArXiv.

[8]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[9]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, ArXiv.

[10]  Li Fei-Fei,et al.  MetaMorph: Learning Universal Controllers with Transformers , 2022, International Conference on Learning Representations.

[11]  Lei Zhang,et al.  Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Stephen James,et al.  Auto-Lambda: Disentangling Dynamic Task Relationships , 2022, Trans. Mach. Learn. Res..

[13]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[14]  T. Müller,et al.  Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[15]  Benjamin Recht,et al.  Plenoxels: Radiance Fields without Neural Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vincent Sitzmann,et al.  Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[17]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[18]  Dieter Fox,et al.  StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[19]  G. Konidaris,et al.  Towards Optimal Correlational Object Search , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[20]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[21]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[22]  Xiaolong Wang,et al.  Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers , 2021, ICLR.

[23]  Stephen James,et al.  Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Stephen James,et al.  Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation , 2021, IEEE Robotics and Automation Letters.

[25]  Nan Rosemary Ke,et al.  Coordination Among Neural Modules Through a Shared Global Workspace , 2021, ICLR.

[26]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[27]  Zhanpeng He,et al.  Universal Manipulation Policy Network for Articulated Objects , 2021, IEEE Robotics and Automation Letters.

[28]  Sergey Levine,et al.  BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[29]  Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer , 2021, ArXiv.

[30]  Maya Cakmak,et al.  Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations , 2021, ArXiv.

[31]  Jitendra Malik,et al.  Differentiable Spatial Planning using Transformers , 2021, ICML.

[32]  Steven K. Feiner,et al.  Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly , 2021, IEEE/RJS International Conference on Intelligent RObots and Systems.

[33]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[34]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[35]  D. Fox,et al.  SORNet: Spatial Object-Centric Representations for Sequential Manipulation , 2021, CoRL.

[36]  Minzhe Niu,et al.  Voxel Transformer for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[38]  Jonathan Tompson,et al.  Implicit Behavioral Cloning , 2021, CoRL.

[39]  Yasuo Kuniyoshi,et al.  Transformer-based deep imitation learning for dual-arm robot manipulation , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[41]  Dieter Fox,et al.  A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution , 2021, CoRL.

[42]  Michael C. Yip,et al.  Motion Planning Transformers: One Model to Plan Them All , 2021, ArXiv.

[43]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[44]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[45]  Edward Johns,et al.  Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Mihir Prabhudesai,et al.  CoCoNets: Continuous Contrastive 3D Scene Representations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Patricio A. Vela,et al.  A Joint Network for Grasp Detection Conditioned on Natural Language Commands , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[51]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[52]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[53]  Joelle Pineau,et al.  Multi-Task Reinforcement Learning with Context-based Representations , 2021, ICML.

[54]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[55]  Matthew J. Hausknecht,et al.  ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.

[56]  Gregory D. Hager,et al.  Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions , 2021, CoRL.

[57]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Mihir Prabhudesai,et al.  3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators , 2020, CoRL.

[59]  Sudeep Dasari,et al.  Transformers for One-Shot Visual Imitation , 2020, CoRL.

[60]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[61]  Leslie Pack Kaelbling,et al.  Integrated Task and Motion Planning , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[62]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[63]  Stefanie Tellex,et al.  Robot Object Retrieval with Contextual Natural Language Queries , 2020, Robotics: Science and Systems.

[64]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[65]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[66]  Pierre Sermanet,et al.  Grounding Language in Play , 2020, ArXiv.

[67]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[68]  Andy Zeng,et al.  Grasping in the Wild: Learning 6DoF Closed-Loop Grasping From Low-Cost Demonstrations , 2019, IEEE Robotics and Automation Letters.

[69]  Dieter Fox,et al.  6-DOF Grasping for Target-driven Object Manipulation in Clutter , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[70]  P. Abbeel,et al.  Learning to Manipulate Deformable Objects without Demonstrations , 2019, Robotics: Science and Systems.

[71]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[72]  D. Fox,et al.  Self-supervised 6D Object Pose Estimation for Robot Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[73]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[74]  Alberto Rodriguez,et al.  TossingBot: Learning to Throw Arbitrary Objects With Residual Physics , 2019, IEEE Transactions on Robotics.

[75]  Ross B. Girshick,et al.  Mask R-CNN , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[77]  D. Fox,et al.  The Best of Both Modes: Separately Leveraging RGB and Depth for Unseen Object Instance Segmentation , 2019, CoRL.

[78]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[79]  Andrew J. Davison,et al.  PyRep: Bringing V-REP to Deep Robot Learning , 2019, ArXiv.

[80]  Dieter Fox,et al.  6-DOF GraspNet: Variational Grasp Generation for Object Manipulation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Dieter Fox,et al.  Prospection: Interpretable plans from language by predicting the future , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[82]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[84]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[85]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[86]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[87]  Leslie Pack Kaelbling,et al.  From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning , 2018, J. Artif. Intell. Res..

[88]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[89]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[90]  Ian Taylor,et al.  Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[91]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[92]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[93]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[94]  Kuan-Ting Yu,et al.  Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[95]  Daniel Marcu,et al.  Natural Language Communication with Robots , 2016, NAACL.

[96]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[97]  Kevin Lee,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[98]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[99]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[100]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[101]  James J. Gibson,et al.  The Ecological Approach to Visual Perception: Classic Edition , 2014 .

[102]  Kostas Daniilidis,et al.  Single image 3D object detection and pose estimation for grasping , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[103]  Luke S. Zettlemoyer,et al.  Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions , 2014, AAAI.

[104]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[105]  Leslie Pack Kaelbling,et al.  Integrated task and motion planning in belief space , 2013, Int. J. Robotics Res..

[106]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[107]  Stefanie Tellex,et al.  Interpreting and Executing Recipes with a Cooking Robot , 2012, ISER.

[108]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[109]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[110]  Hans P. Moravec Robot spatial perception by stereoscopic vision and 3D evidence grids , 1996 .

[111]  R A Brooks,et al.  New Approaches to Robotics , 1991, Science.

[112]  Ramesh C. Jain,et al.  Building an environment model using depth information , 1989, Computer.