VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects

Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VATMART to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data. Please check the project webpage for code, data, video, and more materials.

[1]  Marco Hutter,et al.  Articulated Object Interaction in Unknown Scenes with Whole-Body Mobile Manipulation , 2021, ArXiv.

[2]  Yiannis Demiris,et al.  Online Unsupervised Learning of the 3D Kinematic Structure of Arbitrary Rigid Bodies , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Sonia Chernova,et al.  Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Performance? , 2019, IEEE Robotics and Automation Letters.

[4]  Silvio Savarese,et al.  KETO: Learning Keypoint Representations for Tool Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Pieter Abbeel,et al.  Learning Robotic Manipulation through Visual Planning and Acting , 2019, Robotics: Science and Systems.

[6]  Cewu Lu,et al.  CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Shuran Song,et al.  Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  John K. Tsotsos,et al.  Active object recognition , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Kristen Grauman,et al.  Learning Affordance Landscapes for Interaction Exploration in 3D Environments , 2020, NeurIPS.

[10]  Scott Niekum,et al.  ScrewNet: Category-Independent Articulation Model Estimation From Depth Images Using Screw Theory , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Yuke Zhu,et al.  Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations , 2021, Robotics: Science and Systems.

[12]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[13]  Nuno Vasconcelos,et al.  A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Abderrahmane Kheddar,et al.  Visual estimation of articulated objects configuration during manipulation with a humanoid , 2017, 2017 IEEE/SICE International Symposium on System Integration (SII).

[15]  Dhruv Batra,et al.  Sim-to-Real Transfer for Vision-and-Language Navigation , 2020, CoRL.

[16]  Kristen Grauman,et al.  End-to-End Policy Learning for Active Visual Categorization , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Marc Pollefeys,et al.  Automatic Kinematic Chain Building from Feature Trajectories of Articulated Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.

[19]  Szymon Rusinkiewicz,et al.  Spatial Action Maps for Mobile Manipulation , 2020, Robotics: Science and Systems.

[20]  J. Andrew Bagnell,et al.  Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects , 2013, 2013 IEEE International Conference on Robotics and Automation.

[21]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jeannette Bohg,et al.  OmniHang: Learning to Hang Arbitrary Objects using Contact Point Correspondences and Neural Collision Estimation , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Xinlei Chen,et al.  Embodied Visual Recognition , 2019, ArXiv.

[24]  Xiaogang Wang,et al.  Shape2Motion: Joint Analysis of Motion Parts and Attributes From 3D Shapes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Oliver Brock,et al.  Coupled recursive estimation for online interactive perception of articulated objects , 2019, Int. J. Robotics Res..

[26]  Pieter Abbeel,et al.  DoorGym: A Scalable Door Opening Environment And Baseline Agent , 2019, ArXiv.

[27]  Maren Bennewitz,et al.  Whole-body motion planning for manipulation of articulated objects , 2013, 2013 IEEE International Conference on Robotics and Automation.

[28]  Wolfram Burgard,et al.  A Probabilistic Framework for Learning Kinematic Models of Articulated Objects , 2011, J. Artif. Intell. Res..

[29]  Oliver Brock,et al.  Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[31]  Danica Kragic,et al.  Learning Task-Oriented Grasping From Human Activity Datasets , 2019, IEEE Robotics and Automation Letters.

[32]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Maxim Likhachev,et al.  Planning for autonomous door opening with a mobile manipulator , 2010, 2010 IEEE International Conference on Robotics and Automation.

[34]  Xinyu Liu,et al.  Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics , 2017, Robotics: Science and Systems.

[35]  Dimitrios Tzionas,et al.  Reconstructing Articulated Rigged Models from RGB-D Videos , 2016, ECCV Workshops.

[36]  Ian D. Walker,et al.  Occlusion-aware reconstruction and manipulation of 3D articulated objects , 2012, 2012 IEEE International Conference on Robotics and Automation.

[37]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Yu Zhu,et al.  Learning Semantic Keypoint Representations for Door Opening Manipulation , 2020, IEEE Robotics and Automation Letters.

[39]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[40]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[41]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[42]  Sergey Levine,et al.  Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Dominik Belter,et al.  Kinematic Structures Estimation on the RGB-D Images , 2020, 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).

[44]  Advait Jain,et al.  Pulling open novel doors and drawers with equilibrium point control , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Jitendra Malik,et al.  Learning Instance Segmentation by Interaction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[47]  Lars Petersson,et al.  High-level control of a mobile manipulator for door opening , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[48]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[49]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[50]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51]  Gregory Hager,et al.  “Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer , 2019, IEEE Robotics and Automation Letters.

[52]  Mohi Khansari,et al.  RL-CycleGAN: Reinforcement Learning Aware Simulation-to-Real , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Oliver Brock,et al.  An integrated approach to visual perception of articulated objects , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Oliver Brock,et al.  Building Kinematic and Dynamic Models of Articulated Objects with Multi-Modal Interactive Perception , 2017, AAAI Spring Symposia.

[55]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[56]  Oliver Brock,et al.  Extracting kinematic background knowledge from interactions using task-sensitive relational learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Oliver Brock,et al.  Learning to Manipulate Articulated Objects in Unstructured Environments Using a Grounded Relational Representation , 2008, Robotics: Science and Systems.

[58]  Abhinav Gupta,et al.  The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[59]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[60]  Stefanie Tellex,et al.  Learning to Generalize Kinematic Models to Novel Objects , 2019, CoRL.

[61]  Gaurav S. Sukhatme,et al.  Active articulation model estimation through interactive perception , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[62]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[63]  Kristen Grauman,et al.  Ego-Topo: Environment Affordances From Egocentric Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Shubham Tulsiani,et al.  Where2Act: From Pixels to Actions for Articulated 3D Objects , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Szymon Rusinkiewicz,et al.  Spatial Intention Maps for Multi-Agent Mobile Manipulation , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[66]  A. Lynn Abbott,et al.  Category-Level Articulated Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Dorsa Sadigh,et al.  Learning Visually Guided Latent Actions for Assistive Teleoperation , 2021, L4DC.

[68]  Carme Torras,et al.  Robust and Adaptive Door Operation with a Mobile Robot. , 2019 .

[69]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[70]  Silvio Savarese,et al.  Learning task-oriented grasping for tool manipulation from simulated self-supervision , 2018, Robotics: Science and Systems.

[71]  Oliver Brock,et al.  Interactive Perception of Articulated Objects , 2010, ISER.

[72]  Hui Huang,et al.  RPM-Net , 2019, ACM Trans. Graph..

[73]  Silvio Savarese,et al.  Deep Affordance Foresight: Planning Through What Can Be Done in the Future , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[74]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[75]  Kristen Grauman,et al.  Learning Dexterous Grasping with Object-Centric Visual Affordances , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[76]  Ashutosh Saxena,et al.  Learning to open new doors , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[77]  Dieter Fox,et al.  DART: Dense Articulated Real-Time Tracking , 2014, Robotics: Science and Systems.

[78]  Roozbeh Mottaghi,et al.  Learning About Objects by Learning to Interact with Them , 2020, NeurIPS.

[79]  Wolfram Burgard,et al.  Learning Kinematic Models for Articulated Objects , 2009, IJCAI.

[80]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[81]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Abderrahmane Kheddar,et al.  Interlinked Visual Tracking and Robotic Manipulation of Articulated Objects , 2018, IEEE Robotics and Automation Letters.

[84]  Oliver Kroemer,et al.  Visual Identification of Articulated Object Parts , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[85]  Shiyang Lu,et al.  Factored Pose Estimation of Articulated Objects using Efficient Nonparametric Belief Propagation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[86]  Silvio Savarese,et al.  Demo2Vec: Reasoning Object Affordances from Online Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Santhosh K. Ramakrishnan,et al.  An Exploration of Embodied Visual Exploration , 2021, Int. J. Comput. Vis..

[88]  Gregory D. Hager,et al.  Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation , 2020, ArXiv.

[89]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[91]  Jiajun Wu,et al.  DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions , 2019, Robotics: Science and Systems.

[92]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[93]  Manuel Lopes,et al.  Learning grasping affordances from local visual descriptors , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[94]  Oliver Brock,et al.  Interactive Perception: Leveraging Action in Perception and Perception in Action , 2016, IEEE Transactions on Robotics.

[95]  Stéphane Doncieux,et al.  Building an Affordances Map With Interactive Perception , 2019, Frontiers in Neurorobotics.

[96]  Oliver Brock,et al.  The RBO dataset of articulated objects and interactions , 2018, Int. J. Robotics Res..