Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation

We present Neural Descriptor Fields (NDFs), an object representation that encodes both points and relative poses between an object and a target (such as a robot gripper or a rack used for hanging) via category-level descriptors. We employ this representation for object manipulation, where given a task demonstration, we want to repeat the same task on a new object instance from the same category. We propose to achieve this objective by searching (via optimization) for the pose whose descriptor matches that observed in the demonstration. NDFs are conveniently trained in a self-supervised fashion via a 3D auto-encoding task that does not rely on expert-labeled keypoints. Further, NDFs are SE(3)-equivariant, guaranteeing performance that generalizes across all possible 3D object translations and rotations. We demonstrate learning of manipulation tasks from few (∼5-10) demonstrations both in simulation and on a real robot. Our performance generalizes across both object instances and 6-DoF object poses, and significantly outperforms a recent baseline that relies on 2D descriptors. Project website: https://yilundu.github.io/ndf/

[1]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[2]  Avinash C. Kak,et al.  Real-time tracking and pose estimation for industrial objects using geometric features , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[3]  Wei Gao,et al.  kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation , 2019, ISRR.

[4]  Gordon Wetzstein,et al.  MetaSDF: Meta-learning Signed Distance Functions , 2020, NeurIPS.

[5]  Vincent Sitzmann,et al.  Deep Medial Fields , 2021, ArXiv.

[6]  Li Fei-Fei,et al.  ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations , 2021, CoRL.

[7]  Akira Nakamura,et al.  Probabilistic approach for object bin picking approximated by cylinders , 2013, 2013 IEEE International Conference on Robotics and Automation.

[8]  Changil Kim,et al.  Space-time Neural Irradiance Fields for Free-Viewpoint Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Vincent Sitzmann,et al.  Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering , 2021, NeurIPS.

[10]  Russ Tedrake,et al.  Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.

[11]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pieter Abbeel,et al.  Learning from Demonstrations Through the Use of Non-rigid Registration , 2013, ISRR.

[13]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[14]  Xin Tong,et al.  Deformed Implicit Field: Modeling 3D Shapes with Learned Dense Correspondence , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Marc Pollefeys,et al.  Convolutional Occupancy Networks , 2020, ECCV.

[16]  Robert Platt,et al.  Pick and Place Without Geometric Object Models , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Andreas Geiger,et al.  Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Henrik I. Christensen,et al.  Automatic grasp planning using shape primitives , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[20]  Matthew Tancik,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andrea Tagliasacchi,et al.  Vector Neurons: A General Framework for SO(3)-Equivariant Networks , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Jonathan T. Barron,et al.  Deformable Neural Radiance Fields , 2020, ArXiv.

[23]  Noah Snavely,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[26]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[28]  Torsten Kröger,et al.  Self-Supervised Learning for Precise Pick-and-Place Without Object Model , 2020, IEEE Robotics and Automation Letters.

[29]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[30]  Jonathan T. Barron,et al.  iNeRF: Inverting Neural Radiance Fields for Pose Estimation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[31]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Xiaolong Wang,et al.  Learning Continuous Image Representation with Local Implicit Image Function , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Thomas Funkhouser,et al.  Grasping in the Wild: Learning 6DoF Closed-Loop Grasping From Low-Cost Demonstrations , 2020, IEEE Robotics and Automation Letters.

[35]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[36]  Kostas Daniilidis,et al.  Single image 3D object detection and pose estimation for grasping , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[38]  Leslie Pack Kaelbling,et al.  Shape-Based Transfer of Generic Skills , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Priya Sundaresan,et al.  Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Jiajun Wu,et al.  Neural Radiance Flow for 4D View Synthesis and Video Processing , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Wei Gao,et al.  kPAM-SC: Generalizable Manipulation Planning using KeyPoint Affordance and Shape Completion , 2019, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Wei Gao,et al.  kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation , 2021, IEEE Robotics and Automation Letters.

[44]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[45]  Antonio Torralba,et al.  BARF: Bundle-Adjusting Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[47]  Gordon Wetzstein,et al.  Semantic Implicit Neural Scene Representations With Semi-Supervised Training , 2020, 2020 International Conference on 3D Vision (3DV).

[48]  Huei Peng,et al.  Correspondence-Free Point Cloud Registration with SO(3)-Equivariant Implicit Shape Representations , 2021, CoRL.