Deferred neural rendering

The modern computer graphics pipeline can synthesize images at remarkable visual quality; however, it requires well-defined, high-quality 3D content as input. In this work, we explore the use of imperfect 3D content, for instance, obtained from photo-metric reconstructions with noisy and incomplete surface geometry, while still aiming to produce photo-realistic (re-)renderings. To address this challenging problem, we introduce Deferred Neural Rendering, a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable components. Specifically, we propose Neural Textures, which are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by our new deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect. In contrast to traditional, black-box 2D generative neural networks, our 3D representation gives us explicit control over the generated output, and allows for a wide range of application domains. For instance, we can synthesize temporally-consistent video re-renderings of recorded 3D scenes as our representation is inherently embedded in 3D space. This way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates. We show the effectiveness of our approach in several experiments on novel view synthesis, scene editing, and facial reenactment, and compare to state-of-the-art approaches that leverage the standard graphics pipeline as well as conventional generative neural networks.

[1]  Frédo Durand,et al.  Differentiable programming for image processing and deep learning in halide , 2018, ACM Trans. Graph..

[2]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[3]  Pushmeet Kohli,et al.  Fusion4D , 2016, ACM Trans. Graph..

[4]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[6]  Gabriel J. Brostow,et al.  Interpretable Transformations with Encoder-Decoder Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[8]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[9]  Michael Bosse,et al.  Unstructured lumigraph rendering , 2001, SIGGRAPH.

[10]  Anita Sellent,et al.  Floating Textures , 2008, Comput. Graph. Forum.

[11]  Jonas Unger,et al.  Learning based compression for real-time rendering of surface light fields , 2013, SIGGRAPH '13.

[12]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[13]  Bernd Girod,et al.  Adaptive block-based light field coding , 1999 .

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Graham Fyffe,et al.  Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[16]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[17]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[18]  Matthias Nießner,et al.  VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[19]  Andreas Rössler,et al.  FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces , 2018, ArXiv.

[20]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[21]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[22]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Adrian Hilton,et al.  Model Flow : Precomputed Appearance Alignment for Real-time 4 D Video Interpolation , 2015 .

[24]  Noah Snavely,et al.  Layer-structured 3D Scene Inference via View Synthesis , 2018, ECCV.

[25]  Amitabh Varshney,et al.  Montage4D: interactive seamless fusion of multiview video textures , 2018, I3D.

[26]  Reinhard Koch,et al.  Plenoptic Modeling and Rendering from Image Sequences Taken by Hand-Held Camera , 1999, DAGM-Symposium.

[27]  David Salesin,et al.  Surface light fields for 3D photography , 2000, SIGGRAPH.

[28]  Andreas Rössler,et al.  ForensicTransfer: Weakly-supervised Domain Adaptation for Forgery Detection , 2018, ArXiv.

[29]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Adrian Hilton,et al.  4D Model Flow: Precomputed Appearance Alignment for Real‐time 4D Video Interpolation , 2015, Comput. Graph. Forum.

[31]  Jan-Michael Frahm,et al.  Deep blending for free-viewpoint image-based rendering , 2018, ACM Trans. Graph..

[32]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[33]  Li Zhang,et al.  Soft 3D reconstruction for view synthesis , 2017, ACM Trans. Graph..

[34]  Hans-Peter Seidel,et al.  Free-viewpoint video of human actors , 2003, ACM Trans. Graph..

[35]  George Drettakis,et al.  Scalable inside-out image-based rendering , 2016, ACM Trans. Graph..

[36]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[37]  Hao Li,et al.  paGAN: real-time avatars using dynamic textures , 2019, ACM Trans. Graph..

[38]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..

[39]  Leonidas J. Guibas,et al.  3Dlite , 2017, ACM Trans. Graph..

[40]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Max Welling,et al.  Transformation Properties of Learned Visual Representations , 2014, ICLR.

[42]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ersin Yumer,et al.  Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[45]  Wei-Chao Chen,et al.  Light field mapping: efficient representation and hardware rendering of surface light fields , 2002, SIGGRAPH.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  ERIC PENNER,et al.  So 3 D Reconstruction for View Synthesis , 2017 .

[48]  Justus Thies,et al.  IGNOR: Image-guided Neural Object Rendering , 2018, ArXiv.

[49]  Marc Levoy,et al.  Light field rendering , 1996, SIGGRAPH.

[50]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[51]  Shenghua Gao,et al.  Deep Surface Light Fields , 2018, PACMCGIT.

[52]  David Salesin,et al.  Parallax photography: creating 3D cinematic effects from stills , 2009, Graphics Interface.

[53]  Ting-Chun Wang,et al.  Learning-based view synthesis for light field cameras , 2016, ACM Trans. Graph..

[54]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[55]  Nicola De Cao,et al.  Explorations in Homeomorphic Variational Auto-Encoding , 2018, ArXiv.

[56]  Vladlen Koltun,et al.  Color map optimization for 3D reconstruction with consumer depth cameras , 2014, ACM Trans. Graph..

[57]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[58]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[59]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[60]  Jiawen Chen,et al.  Scalable real-time volumetric surface reconstruction , 2013, ACM Trans. Graph..

[61]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[62]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  George Drettakis,et al.  Depth synthesis and local warps for plausible image-based navigation , 2013, TOGS.

[66]  Richard Szeliski,et al.  The lumigraph , 1996, SIGGRAPH.

[67]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[68]  Yizhou Yu,et al.  Efficient View-Dependent Image-Based Rendering with Projective Texture-Mapping , 1998, Rendering Techniques.

[69]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[70]  Ming Zeng,et al.  Octree-based fusion for realtime 3D reconstruction , 2013, Graph. Model..

[71]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.