3D Neural Scene Representations for Visuomotor Control

Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.

[1]  Noah Snavely,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[3]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[4]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[5]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[6]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[7]  Matthew T. Mason,et al.  Pushing revisited: Differential flatness, trajectory planning, and stabilization , 2019, Int. J. Robotics Res..

[8]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[10]  Silvio Savarese,et al.  6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[12]  Russ Tedrake,et al.  Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning , 2020, CoRL.

[13]  Jiajun Wu,et al.  Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids , 2018, ICLR.

[14]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[15]  Gabriel J. Brostow,et al.  Interpretable Transformations with Encoder-Decoder Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Alexandre Bernardino,et al.  Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions , 2019, IEEE Transactions on Cognitive and Developmental Systems.

[17]  Jiajun Wu,et al.  Visual Object Networks: Image Generation with Disentangled 3D Representations , 2018, NeurIPS.

[18]  Wei Gao,et al.  kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation , 2019, ISRR.

[19]  Michael Goesele,et al.  Neural 3D Video Synthesis , 2021, ArXiv.

[20]  Changil Kim,et al.  Space-time Neural Irradiance Fields for Free-Viewpoint Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alex Trevithick,et al.  GRF: Learning a General Radiance Field for 3D Scene Representation and Rendering , 2020, ArXiv.

[22]  Connor Schenck,et al.  Perceiving and reasoning about liquids using fully convolutional networks , 2017, Int. J. Robotics Res..

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Katerina Fragkiadaki,et al.  Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Christian Theobalt,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Russ Tedrake,et al.  The Surprising Effectiveness of Linear Models for Visual Foresight in Object Pile Manipulation , 2020, WAFR.

[27]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[29]  Tae-Yong Kim,et al.  Unified particle physics for real-time applications , 2014, ACM Trans. Graph..

[30]  Yong-Liang Yang,et al.  RenderNet: A deep convolutional network for differentiable rendering from 3D shapes , 2018, NeurIPS.

[31]  Alberto Rodriguez,et al.  Feedback Control of the Pusher-Slider System: A Story of Hybrid and Underactuated Contact Dynamics , 2016, WAFR.

[32]  Gordon Wetzstein,et al.  MetaSDF: Meta-learning Signed Distance Functions , 2020, NeurIPS.

[33]  Yen-Chen Lin,et al.  Experience-Embedded Visual Foresight , 2019, CoRL.

[34]  Jonathan T. Barron,et al.  Deformable Neural Radiance Fields , 2020, ArXiv.

[35]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Thomas Brox,et al.  Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[37]  Jiajun Wu,et al.  Propagation Networks for Model-Based Control Under Partial Observation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[38]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[39]  Edwin Olson,et al.  AprilTag: A robust and flexible visual fiducial system , 2011, 2011 IEEE International Conference on Robotics and Automation.

[40]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[41]  Evangelos Theodorou,et al.  Model Predictive Path Integral Control using Covariance Variable Importance Sampling , 2015, ArXiv.

[42]  Connor Schenck,et al.  Reasoning About Liquids via Closed-Loop Simulation , 2017, Robotics: Science and Systems.

[43]  Wei Gao,et al.  kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation , 2021, IEEE Robotics and Automation Letters.

[44]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[45]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[46]  Daniel L. K. Yamins,et al.  Visual Grounding of Learned Physical Models , 2020, ICML.

[47]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[48]  Maria Bauzá,et al.  A Data-Efficient Approach to Precise and Controlled Pushing , 2018, CoRL.

[49]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[50]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[51]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Andreas Geiger,et al.  Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Matthew Tancik,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Gene F. Franklin,et al.  Feedback Control of Dynamic Systems , 1986 .

[55]  George Loizou,et al.  Computer vision and pattern recognition , 2007, Int. J. Comput. Math..

[56]  Dieter Fox,et al.  Causal Discovery in Physical Systems from Videos , 2020, NeurIPS.

[57]  Felix Heide,et al.  Neural Scene Graphs for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Katerina Fragkiadaki,et al.  3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators , 2020, CoRL.

[59]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jiajun Wu,et al.  Neural Radiance Flow for 4D View Synthesis and Video Processing , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.