Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.

[1]  K. Althoefer,et al.  Tac-Man: Tactile-Informed Prior-Free Manipulation of Articulated Objects , 2024, ArXiv.

[2]  Jiechuan Jiang,et al.  Bi-DexHands: Towards Human-Level Bimanual Dexterous Manipulation , 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yiran Geng,et al.  Grasp Multiple Objects With One Hand , 2023, IEEE Robotics and Automation Letters.

[4]  Yecheng Jason Ma,et al.  Eureka: Human-Level Reward Design via Coding Large Language Models , 2023, ArXiv.

[5]  Pannag R. Sanketi,et al.  Open X-Embodiment: Robotic Learning Datasets and RT-X Models , 2023, ArXiv.

[6]  D. Schuurmans,et al.  Learning Interactive Real-World Simulators , 2023, ArXiv.

[7]  Pannag R. Sanketi,et al.  RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , 2023, CoRL.

[8]  Karl Van Wyk,et al.  AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System , 2023, Robotics: Science and Systems.

[9]  D. Fox,et al.  AR2-D2: Training a Robot Without a Robot , 2023, CoRL.

[10]  Ben Eisner,et al.  FlowBot++: Learning Generalized Articulated Objects Manipulation via Articulation Projection , 2023, CoRL.

[11]  Saurabh Gupta,et al.  Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos , 2023, NeurIPS.

[12]  Hao Dong,et al.  PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shalini De Mello,et al.  Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  He Wang,et al.  UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sang Michael Xie,et al.  Reward Design with Language Models , 2023, ICLR.

[16]  Li Fei-Fei,et al.  MimicPlay: Long-Horizon Imitation Learning by Watching Human Play , 2023, CoRL.

[17]  P. Abbeel,et al.  Guiding Pretraining in Reinforcement Learning with Large Language Models , 2023, ICML.

[18]  Rui Chen,et al.  ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills , 2023, ICLR.

[19]  Vikash Kumar,et al.  Zero-Shot Robot Manipulation from Passive Human Videos , 2023, ArXiv.

[20]  Pannag R. Sanketi,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[21]  E. Adelson,et al.  Visual dexterity: In-hand reorientation of novel and complex object shapes , 2022, Science Robotics.

[22]  He Wang,et al.  GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yaodong Yang,et al.  GenDexGrasp: Generalizable Dexterous Grasping , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Yecheng Jason Ma,et al.  VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training , 2022, ICLR.

[25]  Shikhar Bahl,et al.  Human-to-Robot Imitation in the Wild , 2022, Robotics: Science and Systems.

[26]  Anima Anandkumar,et al.  MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge , 2022, NeurIPS.

[27]  Xiaolong Wang,et al.  From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation From Single-Camera Teleoperation , 2022, IEEE Robotics and Automation Letters.

[28]  Z. Li,et al.  Towards An End-to-End Framework for Flow-Guided Video Inpainting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[30]  Yuke Zhu,et al.  Ditto: Building Digital Twins of Articulated Objects from Interaction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhanpeng He,et al.  Universal Manipulation Policy Network for Articulated Objects , 2021, IEEE Robotics and Automation Letters.

[33]  Miles Macklin,et al.  Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning , 2021, NeurIPS Datasets and Benchmarks.

[34]  Xiaolong Wang,et al.  DexMV: Imitation Learning for Dexterous Manipulation from Human Videos , 2021, ECCV.

[35]  Rutav Shah,et al.  RRL: Resnet as representation for Reinforcement Learning , 2021, ICML.

[36]  L. Guibas,et al.  VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects , 2021, ICLR.

[37]  Song-Chun Zhu,et al.  Synthesizing Diverse and Physically Stable Grasps With Arbitrary Hand Structures Using Differentiable Force Closure Estimator , 2021, IEEE Robotics and Automation Letters.

[38]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[39]  S. Levine,et al.  Reinforcement Learning with Videos: Combining Offline Observations with Interaction , 2020, CoRL.

[40]  Cewu Lu,et al.  GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Dima Damen,et al.  The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Joshua B. Tenenbaum,et al.  Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense , 2020, Engineering.

[44]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[45]  Jianlan Luo,et al.  UniGrasp: Learning a Unified Model to Grasp With Multifingered Robotic Hands , 2019, IEEE Robotics and Automation Letters.

[46]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[47]  Oliver Kroemer,et al.  A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms , 2019, J. Mach. Learn. Res..

[48]  Danica Kragic,et al.  Trends and challenges in robot manipulation , 2019, Science.

[49]  Dieter Fox,et al.  ContactGrasp: Functional Multi-finger Grasp Synthesis from Contact , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[50]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[51]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Sergey Levine,et al.  Unsupervised Perceptual Rewards for Imitation Learning , 2016, Robotics: Science and Systems.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[55]  C. Karen Liu,et al.  Stable Proportional-Derivative Controllers , 2011, IEEE Computer Graphics and Applications.