Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes

We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for complex dynamic scenes, including thin structures, view-dependent effects, and natural degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.

[1]  Robert Bregovic,et al.  Light Field Reconstruction Using Shearlet Transform , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Karol Myszkowski,et al.  X-Fields , 2020, ACM Trans. Graph..

[3]  Feng Liu,et al.  Softmax Splatting for Video Frame Interpolation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kalyan Sunkavalli,et al.  Deep view synthesis from sparse photometric images , 2019, ACM Trans. Graph..

[5]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jan Kautz,et al.  Extreme View Synthesis , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Convolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Rui Yu,et al.  Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes , 2014, ECCV.

[9]  Noah Snavely,et al.  Neural Rerendering in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[11]  Yu Ji,et al.  A Neural Rendering Framework for Free-Viewpoint Relighting , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pushmeet Kohli,et al.  Fusion4D , 2016, ACM Trans. Graph..

[13]  Feng Liu,et al.  3D Ken Burns effect from a single image , 2019, ACM Trans. Graph..

[14]  Gernot Riegler,et al.  Free View Synthesis , 2020, ECCV.

[15]  Jan-Michael Frahm,et al.  Deep blending for free-viewpoint image-based rendering , 2018, ACM Trans. Graph..

[16]  Noah Snavely,et al.  Single-View View Synthesis With Multiplane Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Li Zhang,et al.  Soft 3D reconstruction for view synthesis , 2017, ACM Trans. Graph..

[18]  Jan-Michael Frahm,et al.  Sparse Dynamic 3D Reconstruction from Unsynchronized Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Rudolf Mester,et al.  Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic Scenes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yaser Sheikh,et al.  4D Visualization of Dynamic Events From Unconstrained Multi-View Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  George Drettakis,et al.  Depth synthesis and local warps for plausible image-based navigation , 2013, TOGS.

[24]  Deqing Sun,et al.  Layered RGBD scene flow estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  James M. Rehg,et al.  Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation , 2018, ECCV.

[26]  Matthias Nießner,et al.  DeepDeform: Learning Non-Rigid RGB-D Reconstruction With Semi-Supervised Data , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, ArXiv.

[29]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Yaser Sheikh,et al.  3D Reconstruction of a Moving Point from a Series of 2D Projections , 2010, ECCV.

[31]  Yaser Sheikh,et al.  Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[34]  Ravi Ramamoorthi,et al.  Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines , 2019 .

[35]  David Salesin,et al.  Layered neural rendering for retiming people in video , 2020, ACM Trans. Graph..

[36]  Marcus A. Magnor,et al.  View and Time Interpolation in Image Space , 2008, Comput. Graph. Forum.

[37]  Graham Fyffe,et al.  Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[38]  Stefan Roth,et al.  Self-Supervised Monocular Scene Flow Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Xiaoyun Zhang,et al.  Depth-Aware Video Frame Interpolation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Vladlen Koltun,et al.  Dense Monocular Depth Estimation in Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Stefan Roth,et al.  UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss , 2017, AAAI.

[43]  Jonathan T. Barron,et al.  NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , 2020, ECCV.

[44]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[45]  Michael Bosse,et al.  Unstructured lumigraph rendering , 2001, SIGGRAPH.

[46]  Kalyan Sunkavalli,et al.  Deep 3D Capture: Geometry and Reflectance From Sparse Multi-View Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Richard Szeliski,et al.  High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[48]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Jan Kautz,et al.  Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[51]  Paul Debevec,et al.  DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[53]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Jan Kautz,et al.  SENSE: A Shared Encoder Network for Scene-Flow Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Richard Szeliski,et al.  The lumigraph , 1996, SIGGRAPH.

[56]  Frédo Durand,et al.  Light Field Reconstruction Using Sparsity in the Continuous Fourier Domain , 2014, ACM Trans. Graph..

[57]  Simon Lucey,et al.  General trajectory prior for Non-Rigid reconstruction , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Feng Liu,et al.  Context-Aware Synthesis for Video Frame Interpolation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Frédo Durand,et al.  Unstructured Light Fields , 2012, Comput. Graph. Forum.

[60]  Zhengqi Li,et al.  Crowdsampling the Plenoptic Function , 2020, ECCV.

[61]  Paul Debevec,et al.  A Low Cost Multi-Camera Array for Panoramic Light Field Video Capture , 2019, SIGGRAPH Asia Posters.

[62]  Oliver Wang,et al.  Revisiting Adaptive Convolutions for Video Frame Interpolation , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[63]  Andrew W. Fitzgibbon,et al.  Real-time non-rigid reconstruction using an RGB-D camera , 2014, ACM Trans. Graph..

[64]  Matthias Nießner,et al.  VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[65]  Jan Kautz,et al.  Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Hongdong Li,et al.  Monocular Dense 3D Reconstruction of a Complex Dynamic Scene from Two Perspective Frames , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Marc Levoy,et al.  Light field rendering , 1996, SIGGRAPH.

[68]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[69]  Richard Szeliski,et al.  SynSin: End-to-End View Synthesis From a Single Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Brian Okorn,et al.  Just Go With the Flow: Self-Supervised Scene Flow Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Paul Debevec,et al.  Immersive light field video with a layered mesh representation , 2020, ACM Trans. Graph..

[73]  Harry Shum,et al.  Plenoptic sampling , 2000, SIGGRAPH.

[74]  Yannick Hold-Geoffroy,et al.  Deep Reflectance Volumes: Relightable Reconstructions from Multi-View Photometric Images , 2020, ECCV.

[75]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[76]  Yaser Sheikh,et al.  Kronecker-Markov Prior for Dynamic 3D Reconstruction , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Jia-Bin Huang,et al.  3D Photography Using Context-Aware Layered Depth Inpainting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Jitendra Malik,et al.  Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach , 1996, SIGGRAPH.