K-VIL: Keypoints-based Visual Imitation Learning

Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations, as well as its efficiency and robustness in both one-shot and few-shot imitation learning settings. Videos and source code are available at https://sites.google.com/view/k-vil.

[1]  Christian R. G. Dreher,et al.  Learning Temporal Task Models from Human Bimanual Demonstrations , 2022, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[2]  T. Asfour,et al.  A Bimanual Manipulation Taxonomy , 2022, IEEE Robotics and Automation Letters.

[3]  M. Deisenroth,et al.  One-Shot Transfer of Affordance Regions? AffCorrs! , 2022, CoRL.

[4]  Kostas Daniilidis,et al.  CaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural Homeomorphism , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Peter R. Florence,et al.  NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[6]  Martin Jägersand,et al.  Generalizable task representation learning from human demonstration videos: a geometric approach , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[7]  P. Stone,et al.  Adversarial Imitation Learning from Video Using a State Observer , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[8]  Vincent Sitzmann,et al.  Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[9]  Lerrel Pinto,et al.  The Surprising Effectiveness of Representation Learning for Visual Imitation , 2021, Robotics: Science and Systems.

[10]  Akshara Rai,et al.  Learning Periodic Tasks from Human Demonstrations , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[11]  Zhanpeng He,et al.  Universal Manipulation Policy Network for Articulated Objects , 2021, IEEE Robotics and Automation Letters.

[12]  Jonathan Tompson,et al.  Implicit Behavioral Cloning , 2021, CoRL.

[13]  Tamim Asfour,et al.  Semantic Scene Manipulation Based on 3D Spatial Object Relations and Language Instructions , 2021, 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids).

[14]  Vincent Sitzmann,et al.  3D Neural Scene Representations for Visuomotor Control , 2021, CoRL.

[15]  L. Guibas,et al.  VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects , 2021, ICLR.

[16]  Peter Stone,et al.  VOILA: Visual-Observation-Only Imitation Learning for Autonomous Navigation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[17]  Yuke Zhu,et al.  Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations , 2021, Robotics: Science and Systems.

[18]  Patricio A. Vela,et al.  An Affordance Keypoint Detection Network for Robot Manipulation , 2021, IEEE Robotics and Automation Letters.

[19]  Mohit Sharma,et al.  Generalizing Object-Centric Task-Axes Controllers using Keypoints , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Wei Gao,et al.  kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation , 2021, IEEE Robotics and Automation Letters.

[21]  Oliver Kroemer,et al.  Learning to Compose Hierarchical Object-Centric Controllers for Robotic Manipulation , 2020, CoRL.

[22]  Russ Tedrake,et al.  Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning , 2020, CoRL.

[23]  Martin Jägersand,et al.  A Geometric Perspective on Visual Imitation Learning , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Pieter Abbeel,et al.  AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos , 2019, Robotics: Science and Systems.

[25]  Deepak Pathak,et al.  Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller , 2019, NeurIPS.

[26]  Jianfeng Gao,et al.  Learning Via-Point Movement Primitives with Inter- and Extrapolation Capabilities , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[27]  You Zhou,et al.  ARMAR-6: A High-Performance Humanoid for Human-Robot Collaboration in Real-World Scenarios , 2019, IEEE Robotics & Automation Magazine.

[28]  Silvio Savarese,et al.  KETO: Learning Keypoint Representations for Tool Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[30]  Oliver Kroemer,et al.  Graph-Structured Visual Imitation , 2019, CoRL.

[31]  Fan Zhang,et al.  MediaPipe: A Framework for Building Perception Pipelines , 2019, ArXiv.

[32]  Peter Stone,et al.  Imitation Learning from Video by Leveraging Proprioception , 2019, IJCAI.

[33]  Jonathan Tompson,et al.  Learning Actionable Representations from Visual Observations , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[34]  Salman Khan,et al.  Visual Affordance and Function Understanding , 2018, ACM Comput. Surv..

[35]  Peter Stone,et al.  Generative Adversarial Imitation from Observation , 2018, ArXiv.

[36]  Russ Tedrake,et al.  Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.

[37]  Jan Peters,et al.  Using probabilistic movement primitives in robotics , 2017, Autonomous Robots.

[38]  Jitendra Malik,et al.  Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Vladlen Koltun,et al.  Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.

[40]  A. Eloyan,et al.  Principal manifold estimation via model complexity selection , 2017, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[41]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Sergey Levine,et al.  Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[45]  Martin Jägersand,et al.  ViTa: Visual task specification interface for manipulation with uncalibrated visual servoing , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Aude Billard,et al.  Task Parameterization Using Continuous Constraints Extracted From Human Demonstrations , 2015, IEEE Transactions on Robotics.

[47]  D. Wolpert,et al.  Principles of sensorimotor learning , 2011, Nature Reviews Neuroscience.

[48]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[49]  W. Schultz,et al.  Neural mechanisms of observational learning , 2010, Proceedings of the National Academy of Sciences.

[50]  Jochen J. Steil,et al.  Automatic selection of task spaces for imitation learning , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[51]  S. Thompson Social Learning Theory , 2008 .

[52]  Gregory D. Hager,et al.  What Tasks can be Performed with an Uncalibrated Stereo Vision System? , 1999, International Journal of Computer Vision.

[53]  Gregory D. Hager,et al.  A Hierarchical Vision Architecture for Robotic Manipulation Tasks , 1999, ICVS.

[54]  Longbo Chen,et al.  FINDING STRUCTURE WITH RANDOMNESS : PROBABILISTIC ALGORITHMS FOR CONSTRUCTING , 2016 .