Neural Fields for Robotic Object Manipulation from a Single Image

—We present a unified and compact representation for object rendering, 3D reconstruction, and grasp pose prediction that can be inferred from a single image within a few seconds. We achieve this by leveraging recent advances in the Neural Radiance Field (NeRF) literature that learn category- level priors and fine-tune on novel objects with minimal data and time. Our insight is that we can learn a compact shape representation and extract meaningful additional information from it, such as grasping poses. We believe this to be the first work to retrieve grasping poses directly from a NeRF-based representation using a single viewpoint (RGB-only), rather than going through a secondary network and/or representation. When compared to prior art, our method is two to three orders of magnitude smaller while achieving comparable performance at view reconstruction and grasping. Accompanying our method, we also propose a new dataset of rendered shoes for training a sim-2-real NeRF method with grasping poses for different widths of grippers.

[1]  D. Fox,et al.  Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, ArXiv.

[2]  Peter R. Florence,et al.  Reinforcement Learning with Neural Radiance Fields , 2022, NeurIPS.

[3]  R. Siegwart,et al.  Sampling-free obstacle gradients and reactive planning in Neural Radiance Fields (NeRF) , 2022, ArXiv.

[4]  Vincent Vanhoucke,et al.  Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[5]  Feras Dayoub,et al.  Implicit Object Mapping With Noisy Data , 2022, 2204.10516.

[6]  M. Nießner,et al.  AutoRF: Learning 3D Object Radiance Fields from Single View Observations , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Peter R. Florence,et al.  NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[8]  Russ Tedrake,et al.  Learning Multi-Object Dynamics with Compositional Neural Radiance Fields , 2022, CoRL.

[9]  Kostas E. Bekris,et al.  You Only Demonstrate Once: Category-Level Manipulation from Single Visual Demonstration , 2022, Robotics: Science and Systems.

[10]  T. Müller,et al.  Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[11]  Orazio Gallo,et al.  Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  D. Ramanan,et al.  Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly- Throughs , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shalini De Mello,et al.  Efficient Geometry-aware 3D Generative Adversarial Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Benjamin Recht,et al.  Plenoxels: Radiance Fields without Neural Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vincent Sitzmann,et al.  Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[16]  Jeannette Bohg,et al.  Vision-Only Robot Navigation in a Neural Radiance World , 2021, IEEE Robotics and Automation Letters.

[17]  Kostas E. Bekris,et al.  CaTGrasp: Learning Category-Level Task-Relevant Grasping in Clutter from Simulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[18]  Stephen Tyree,et al.  Single-Stage Keypoint- Based Category-Level Object Pose Estimation from an RGB Image , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[19]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[20]  Yilun Du,et al.  MIRA: Mental Imagery for Robotic Affordances , 2022, CoRL.

[21]  Ken Goldberg,et al.  Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects , 2021, CoRL.

[22]  Hujun Bao,et al.  Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Lourdes Agapito,et al.  CodeNeRF: Disentangled Neural Radiance Fields for Object Categories , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Byeong-Uk Lee,et al.  Category-Level Metric Scale Object Shape and Pose Estimation , 2021, IEEE Robotics and Automation Letters.

[25]  Kostas E. Bekris,et al.  BundleTrack: 6D Pose Tracking for Novel Objects without Instance or Category-Level 3D Models , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  F. Rousselle,et al.  Neural scene graph rendering , 2021, ACM Transactions on Graphics.

[27]  Vincent Sitzmann,et al.  3D Neural Scene Representations for Visuomotor Control , 2021, CoRL.

[28]  Stephen Tyree,et al.  NViSII: A Scriptable Tool for Photorealistic Image Generation , 2021, ArXiv.

[29]  Yuke Zhu,et al.  Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations , 2021, Robotics: Science and Systems.

[30]  Dieter Fox,et al.  Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Edgar Sucar,et al.  iMAP: Implicit Mapping and Positioning in Real-Time , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Jonathan T. Barron,et al.  iNeRF: Inverting Neural Radiance Fields for Pose Estimation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  Angjoo Kanazawa,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[35]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[36]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[37]  Dieter Fox,et al.  6-DOF GraspNet: Variational Grasp Generation for Object Manipulation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[39]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.