Neural Scene Graphs for Dynamic Scenes

Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. However, existing methods are restricted to learning efficient representations of static scenes that encode all scene objects into a single neural network, and they lack the ability to represent dynamic scenes and decompose scenes into individual objects. In this work, we present the first neural rendering method that represents multi-object dynamic scenes as scene graphs. We propose a learned scene graph representation, which encodes object transformations and radiance, allowing us to efficiently render novel arrangements and views of the scene. To this end, we learn implicitly encoded scenes, combined with a jointly learned latent representation to describe similar objects with a single implicit function. We assess the proposed method on synthetic and real automotive data, validating that our approach learns dynamic scenes – only by observing a video of this scene – and allows for rendering novel photo-realistic views of novel scene compositions with unseen sets of objects at unseen poses.

[1]  David Salesin,et al.  Layered neural rendering for retiming people in video , 2020, ACM Trans. Graph..

[2]  Paul Debevec,et al.  DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[4]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[6]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[7]  Henry Sowizral,et al.  Scene Graphs in the New Millennium , 2000, IEEE Computer Graphics and Applications.

[8]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[9]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Sing Bing Kang,et al.  Revealing Scenes by Inverting Structure From Motion Reconstructions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Victor Lempitsky,et al.  Neural Point-Based Graphics , 2019, ECCV.

[14]  P. Shirley,et al.  A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering , 2018 .

[15]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[17]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[18]  Takeo Kanade,et al.  A multi-body factorization method for motion analysis , 1995, Proceedings of IEEE International Conference on Computer Vision.

[19]  Harry Shum,et al.  Efficient bundle adjustment with virtual key frames: a hierarchical approach to multi-frame structure from motion , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[20]  Ravi Ramamoorthi,et al.  Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines , 2019 .

[21]  Shenghua Gao,et al.  Deep Surface Light Fields , 2018, PACMCGIT.

[22]  Luc Van Gool,et al.  Multibody Structure-from-Motion in Practice , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jianren Wang,et al.  AB3DMOT: A Baseline for 3D Multi-Object Tracking and New Evaluation Metrics , 2020, ArXiv.

[25]  Justus Thies,et al.  Deferred neural rendering , 2019, ACM Trans. Graph..

[26]  Josie Wernecke,et al.  The inventor mentor - programming object-oriented 3D graphics with Open Inventor, release 2 , 1993 .

[27]  Graham Fyffe,et al.  Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[28]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Simone Bianco,et al.  Evaluating the Performance of Structure from Motion Pipelines , 2018, J. Imaging.

[31]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[32]  Kyaw Zaw Lin,et al.  Neural Sparse Voxel Fields , 2020, NeurIPS.

[33]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Shufeng Tan,et al.  Reducing data dimensionality through optimizing neural network inputs , 1995 .

[35]  David R. Nadeau,et al.  VRML 2.0 Sourcebook , 1995 .

[36]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[37]  Duygu Ceylan,et al.  DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[38]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[39]  T. Kanade,et al.  A multi-body factorization method for motion analysis , 1995, ICCV 1995.

[40]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  David R. Nadeau Volume Scene Graphs , 2000, 2000 IEEE Symposium on Volume Visualization (VV 2000).

[42]  Steve Cunningham,et al.  Lessons from scene graphs: using scene graphs to teach hierarchical modeling , 2001, Comput. Graph..

[43]  N. Thürey,et al.  Temporally Coherent GANs for Video Super-Resolution (TecoGAN) , 2018, ArXiv.

[44]  Naila Murray,et al.  Virtual KITTI 2 , 2020, ArXiv.

[45]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  D. R. Nadeau,et al.  Introduction to Programming with Java 3D , 1998, Eurographics.

[47]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).