Neural Scene Graphs for Dynamic Scenes

Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. However, existing methods are restricted to learning efficient interpolations of static scenes that encode all scene objects into a single neural network, lacking the ability to represent dynamic scenes and decompositions into individual scene objects. In this work, we present the first neural rendering method that decomposes dynamic scenes into scene graphs. We propose a learned scene graph representation, which encodes object transformation and radiance, to efficiently render novel arrangements and views of the scene. To this end, we learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function. We assess the proposed method on synthetic and real automotive data, validating that our approach learns dynamic scenes - only by observing a video of this scene - and allows for rendering novel photo-realistic views of novel scene compositions with unseen sets of objects at unseen poses.

[1]  D. R. Nadeau,et al.  Introduction to Programming with Java 3D , 1998, Eurographics.

[2]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Justus Thies,et al.  Deferred neural rendering , 2019, ACM Trans. Graph..

[5]  Takeo Kanade,et al.  A multi-body factorization method for motion analysis , 1995, Proceedings of IEEE International Conference on Computer Vision.

[6]  Luc Van Gool,et al.  Multibody Structure-from-Motion in Practice , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Josie Wernecke,et al.  The inventor mentor - programming object-oriented 3D graphics with Open Inventor, release 2 , 1993 .

[8]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, ArXiv.

[10]  Duygu Ceylan,et al.  DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[11]  Graham Fyffe,et al.  Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[12]  Victor Lempitsky,et al.  Neural Point-Based Graphics , 2019, ECCV.

[13]  Shenghua Gao,et al.  Deep Surface Light Fields , 2018, PACMCGIT.

[14]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[15]  Sing Bing Kang,et al.  Revealing Scenes by Inverting Structure From Motion Reconstructions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[21]  Ravi Ramamoorthi,et al.  Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines , 2019 .

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Simone Bianco,et al.  Evaluating the Performance of Structure from Motion Pipelines , 2018, J. Imaging.

[24]  Naila Murray,et al.  Virtual KITTI 2 , 2020, ArXiv.

[25]  David R. Nadeau,et al.  VRML 2.0 Sourcebook , 1995 .

[26]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[27]  David R. Nadeau Volume Scene Graphs , 2000, 2000 IEEE Symposium on Volume Visualization (VV 2000).

[28]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[30]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Shufeng Tan,et al.  Reducing data dimensionality through optimizing neural network inputs , 1995 .

[32]  David Salesin,et al.  Layered neural rendering for retiming people in video , 2020, ACM Trans. Graph..

[33]  Steve Cunningham,et al.  Lessons from scene graphs: using scene graphs to teach hierarchical modeling , 2001, Comput. Graph..

[34]  Tat-Seng Chua,et al.  Neural Sparse Voxel Fields , 2020, NeurIPS.

[35]  Jonathan T. Barron,et al.  NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , 2020, ECCV.

[36]  Henry Sowizral,et al.  Scene Graphs in the New Millennium , 2000, IEEE Computer Graphics and Applications.

[37]  T. Kanade,et al.  A multi-body factorization method for motion analysis , 1995, ICCV 1995.

[38]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[39]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[40]  Harry Shum,et al.  Efficient bundle adjustment with virtual key frames: a hierarchical approach to multi-frame structure from motion , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[41]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[42]  P. Shirley,et al.  A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering , 2018 .

[43]  Paul Debevec,et al.  DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).