PoCo: Policy Composition from and for Heterogeneous Robot Learning

Training general robotic policies from heterogeneous data for different tasks is a significant challenge. Existing robotic datasets vary in different modalities such as color, depth, tactile, and proprioceptive information, and collected in different domains such as simulation, real robots, and human videos. Current methods usually collect and pool all data from one domain to train a single policy to handle such heterogeneity in tasks and domains, which is prohibitively expensive and difficult. In this work, we present a flexible approach, dubbed Policy Composition, to combine information across such diverse modalities and domains for learning scene-level and task-level generalized manipulation skills, by composing different data distributions represented with diffusion models. Our method can use task-level composition for multi-task manipulation and be composed with analytic cost functions to adapt policy behaviors at inference time. We train our method on simulation, human, and real robot data and evaluate in tool-use tasks. The composed policy achieves robust and dexterous performance under varying scenes and tasks and outperforms baselines from a single data source in both simulation and real-world experiments. See https://liruiw.github.io/policycomp for more details .

[1]  Pannag R. Sanketi,et al.  Octo: An Open-Source Generalist Robot Policy , 2024, ArXiv.

[2]  Peide Huang,et al.  Creative Robot Tool Use with Large Language Models , 2023, ArXiv.

[3]  Shangjie Xue,et al.  Generative Skill Chaining: Long-Horizon Skill Planning with Diffusion Models , 2023, CoRL.

[4]  Yuzhe Qin,et al.  GenSim: Generating Robotic Simulation Tasks via Large Language Models , 2023, ICLR.

[5]  E. Adelson,et al.  GelSight Svelte: A Human Finger-Shaped Single-Camera Tactile Robot Finger with Large Sensing Coverage and Proprioceptive Sensing , 2023, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  E. Adelson,et al.  GelSight Svelte Hand: A Three-finger, Two-DoF, Tactile-rich, Low-cost Robot Hand for Dexterous Manipulation , 2023, ArXiv.

[7]  J. Tenenbaum,et al.  Compositional Diffusion-Based Continuous Constraint Solvers , 2023, CoRL.

[8]  Max Simchowitz,et al.  Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior , 2023, NeurIPS.

[9]  Nima Fazeli,et al.  MultiSCOPE: Disambiguating In-Hand Object Poses with Proprioception and Tactile Feedback , 2023, Robotics: Science and Systems.

[10]  C. Atkeson,et al.  Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement , 2023, Robotics: Science and Systems.

[11]  S. Levine,et al.  Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , 2023, Robotics: Science and Systems.

[12]  Rudolf Lioutikov,et al.  Goal-Conditioned Imitation Learning using Score-based Diffusion Policies , 2023, Robotics: Science and Systems.

[13]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Eric A. Cousineau,et al.  Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , 2023, Robotics: Science and Systems.

[15]  Li Fei-Fei,et al.  MimicPlay: Long-Horizon Imitation Learning by Watching Human Play , 2023, CoRL.

[16]  R. Fergus,et al.  Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC , 2023, ICML.

[17]  Song-Chun Zhu,et al.  Diffusion-based Generation, Optimization, and Planning in 3D Scenes , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pannag R. Sanketi,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[19]  Shikhar Bahl,et al.  VideoDex: Learning Dexterity from Internet Videos , 2022, CoRL.

[20]  J. Tenenbaum,et al.  Is Conditional Generative Modeling all you need for Decision-Making? , 2022, ICLR.

[21]  Sarthak J. Shetty,et al.  ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds , 2022, CoRL.

[22]  Jan Peters,et al.  Hierarchical Policy Blending as Inference for Reactive Robot Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[23]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[24]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[25]  D. Fox,et al.  Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, CoRL.

[26]  Jan Peters,et al.  SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[28]  Ho Kei Cheng,et al.  XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model , 2022, ECCV.

[29]  Allen Z. Ren,et al.  Leveraging Language for Accelerated Learning of Tool Manipulation , 2022, CoRL.

[30]  J. Tenenbaum,et al.  Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[31]  S. Levine,et al.  Planning with Diffusion for Flexible Behavior Synthesis , 2022, ICML.

[32]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[33]  Sergey Levine,et al.  BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[34]  Peter R. Florence,et al.  VIRDO: Visio-tactile Implicit Representations of Deformable Objects , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[35]  Justin Carpentier,et al.  Learning to Manipulate Tools by Aligning Simulation to Video Demonstration , 2021, IEEE Robotics and Automation Letters.

[36]  Russ Tedrake,et al.  SEED: Series Elastic End Effectors in 6D for Visuotactile Tool Use , 2021, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  Anima Anandkumar,et al.  Controllable and Compositional Generation with Latent-Space Energy-Based Models , 2021, NeurIPS.

[38]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Keith B. Hall,et al.  Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , 2021, FINDINGS.

[40]  Leslie Pack Kaelbling,et al.  Shape-Based Transfer of Generic Skills , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[42]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[43]  Dieter Fox,et al.  Goal-Auxiliary Actor-Critic for 6D Robotic Grasping with Point Clouds , 2020, CoRL.

[44]  Tomi Westerlund,et al.  Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , 2020, 2020 IEEE Symposium Series on Computational Intelligence (SSCI).

[45]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[46]  David F. Fouhey,et al.  Understanding Human Hands in Contact at Internet Scale , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Rachel Holladay,et al.  Force-and-Motion Constrained Planning for Tool Use , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Silvio Savarese,et al.  KETO: Learning Keypoint Representations for Tool Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[50]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[51]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[52]  Igor Mordatch,et al.  Model Based Planning with Energy Based Models , 2019, CoRL.

[53]  Joshua B. Tenenbaum,et al.  Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , 2019, Proceedings of the National Academy of Sciences.

[54]  Diederik P. Kingma,et al.  An Introduction to Variational Autoencoders , 2019, Found. Trends Mach. Learn..

[55]  Jitendra Malik,et al.  Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.

[56]  James T. Kwok,et al.  Generalizing from a Few Examples , 2019, ACM Comput. Surv..

[57]  Marc Toussaint,et al.  Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning , 2018, Robotics: Science and Systems.

[58]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[59]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[60]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[61]  Bipin Indurkhya,et al.  Adapting Everyday Manipulation Skills to Varied Scenarios , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[62]  Tom White,et al.  Generative Adversarial Networks: An Overview , 2017, IEEE Signal Processing Magazine.

[63]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[64]  Byron Boots,et al.  Continuous-time Gaussian process motion planning via probabilistic inference , 2017, Int. J. Robotics Res..

[65]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[66]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[67]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[68]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  François Osiurak,et al.  Tool use and affordance: Manipulation-based versus reasoning-based approaches. , 2016, Psychological review.

[70]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[72]  Oliver Kroemer,et al.  Generalizing pouring actions between objects using warped parameters , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[73]  Máximo A. Roa,et al.  Transferring functional grasps through contact warping and local replanning , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[74]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[75]  A. Kacelnik,et al.  Cognitive Processes Associated with Sequential Tool Use in New Caledonian Crows , 2009, PloS one.

[76]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[77]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[78]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[79]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[80]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[81]  Shuang Li,et al.  Compositional Visual Generation with Energy Based Models , 2020, NeurIPS.

[82]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[83]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[84]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[85]  Philip H. Harding Diffusion , 2014, Computer Vision, A Reference Guide.

[86]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .