Neural Radiance Flow for 4D View Synthesis and Video Processing

We present a method, Neural Radiance Flow (NeRFlow),to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only one camera. We further demonstrate that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.

[1]  Matthias Zwicker,et al.  SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jan-Michael Frahm,et al.  Deep blending for free-viewpoint image-based rendering , 2018, ACM Trans. Graph..

[3]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Minglun Gong,et al.  4D Reconstruction of Blooming Flowers , 2017, Comput. Graph. Forum.

[6]  Pat Hanrahan,et al.  A signal-processing framework for inverse rendering , 2001, SIGGRAPH.

[7]  Jan Kautz,et al.  Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Silvio Savarese,et al.  Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments , 2020, IEEE Robotics and Automation Letters.

[9]  Jean-Yves Guillemaut,et al.  General Dynamic Scene Reconstruction from Multiple View Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Xiaoyun Zhang,et al.  Depth-Aware Video Frame Interpolation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shengping Zhang,et al.  Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Mathieu Aubry,et al.  A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Alexandros G. Dimakis,et al.  Compressed Sensing with Deep Image Prior and Learned Regularization , 2018, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  H. Yao,et al.  Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[19]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[20]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Ning Zhang,et al.  Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence , 2018, ECCV.

[22]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[23]  Connor Schenck,et al.  Towards Learning to Perceive and Reason About Liquids , 2016, ISER.

[24]  Ting-Chun Wang,et al.  Learning-based view synthesis for light field cameras , 2016, ACM Trans. Graph..

[25]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Hao Li,et al.  Learning to Infer Implicit Surfaces without 3D Supervision , 2019, NeurIPS.

[28]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[29]  Andreas Geiger,et al.  Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Yaser Sheikh,et al.  4D Visualization of Dynamic Events From Unconstrained Multi-View Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Paul Debevec,et al.  DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Marcus A. Magnor,et al.  Virtual video camera: image-based viewpoint navigation through space and time , 2009, SIGGRAPH '09.

[34]  Theodore Lim,et al.  Generative and Discriminative Voxel Modeling with Convolutional Neural Networks , 2016, ArXiv.

[35]  Richard Szeliski,et al.  SynSin: End-to-End View Synthesis From a Single Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hao Li,et al.  Soft Rasterizer: Differentiable Rendering for Unsupervised Single-View Mesh Reconstruction , 2019, ArXiv.

[37]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Changil Kim,et al.  Space-time Neural Irradiance Fields for Free-Viewpoint Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Christian Theobalt,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Joseph L. Mundy,et al.  Dynamic Probabilistic Volumetric Models , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[43]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[45]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[46]  Michael Bosse,et al.  Unstructured lumigraph rendering , 2001, SIGGRAPH.

[47]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Richard Szeliski,et al.  High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[49]  Thomas A. Funkhouser,et al.  Learning Shape Templates With Structured Implicit Functions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[51]  Edmond Boyer,et al.  Multi-view Dynamic Shape Refinement Using Local Temporal Integration , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Karol Myszkowski,et al.  X-Fields , 2020, ACM Trans. Graph..

[57]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[58]  BradleyDerek,et al.  Time-resolved 3d capture of non-stationary gas flows , 2008 .

[59]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[60]  Jonathan T. Barron,et al.  Deformable Neural Radiance Fields , 2020, ArXiv.

[61]  Nassir Navab,et al.  Long Short-Term Memory Kalman Filters: Recurrent Neural Estimators for Pose Regularization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[63]  Noah Snavely,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Andreas Geiger,et al.  Texture Fields: Learning Texture Representations in Function Space , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[66]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[67]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[69]  Daniel Cremers,et al.  Generalized Connectivity Constraints for Spatio-temporal 3D Reconstruction , 2014, ECCV.

[70]  Duygu Ceylan,et al.  DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[71]  Qifeng Chen,et al.  Blind Video Temporal Consistency via Deep Video Prior , 2020, NeurIPS.

[72]  Jan Kautz,et al.  Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[74]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[75]  Wojciech Matusik,et al.  Moving gradients: a path-based method for plausible image interpolation , 2009, ACM Trans. Graph..

[76]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[77]  Jean-Yves Guillemaut,et al.  Temporally Coherent 4D Reconstruction of Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Derek Bradley,et al.  Detailed spatio-temporal reconstruction of eyelids , 2015, ACM Trans. Graph..

[79]  Paul E. Debevec,et al.  Acquisition of time-varying participating media , 2005, ACM Trans. Graph..

[80]  Ulugbek Kamilov,et al.  Image Restoration Using Total Variation Regularized Deep Image Prior , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[82]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.