NVS Machines: Learning Novel View Synthesis with Fine-grained View Control

We present an approach that learns to synthesize high-quality, novel views of 3D objects or scenes, while providing fine-grained and precise control over the 6-DOF viewpoint. The approach is self-supervised and only requires 2D images and associated view transforms for training. Our main contribution is a network architecture that leverages a transforming auto-encoder in combination with a depth-guided warping procedure to predict geometrically accurate unseen views. Leveraging geometric constraints renders direct supervision via depth or flow maps unnecessary. If large parts of the object are occluded in the source view, a purely learning based prior is used to predict the values for dis-occluded pixels. Our network furthermore predicts a per-pixel mask, used to fuse depth-guided and pixel-based predictions. The resulting images reflect the desired 6-DOF transformation and details are preserved. We thoroughly evaluate our architecture on synthetic and real scenes and under fine-grained and fixed-view settings. Finally, we demonstrate that the approach generalizes to entirely unseen images such as product images downloaded from the internet.

[1]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[2]  Li Zhang,et al.  Soft 3D reconstruction for view synthesis , 2017, ACM Trans. Graph..

[3]  Ning Zhang,et al.  Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence , 2018, ECCV.

[4]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[5]  Leonard McMillan,et al.  Plenoptic Modeling: An Image-Based Rendering System , 2023 .

[6]  Jiajun Wu,et al.  Visual Object Networks: Image Generation with Disentangled 3D Representations , 2018, NeurIPS.

[7]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[8]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Ran He,et al.  Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[11]  Bo Zhao,et al.  Multi-View Image Generation from a Single-View , 2017, ACM Multimedia.

[12]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[13]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[15]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Richard Szeliski,et al.  Layered depth images , 1998, SIGGRAPH.

[18]  Paul Debevec,et al.  Modeling and Rendering Architecture from Photographs , 1996, SIGGRAPH 1996.

[19]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Michael Goesele,et al.  Image-based rendering in the gradient domain , 2013, ACM Trans. Graph..

[23]  Thomas Brox,et al.  Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[24]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[25]  Paul Debevec,et al.  DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[29]  Chen Kong,et al.  Learning Efficient Point Cloud Generation for Dense 3D Object Reconstruction , 2017, AAAI.

[30]  Hao Li,et al.  Soft Rasterizer: Differentiable Rendering for Unsupervised Single-View Mesh Reconstruction , 2019, ArXiv.

[31]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[32]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[33]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew W. Fitzgibbon,et al.  Image-Based Rendering Using Image-Based Priors , 2005, International Journal of Computer Vision.

[37]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[38]  Noah Snavely,et al.  Layer-structured 3D Scene Inference via View Synthesis , 2018, ECCV.

[39]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Richard Szeliski,et al.  The lumigraph , 1996, SIGGRAPH.

[41]  Thomas Brox,et al.  Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[42]  Paul A. Beardsley,et al.  Image-based 3D photography using opacity hulls , 2002, ACM Trans. Graph..

[43]  Gabriel J. Brostow,et al.  Interpretable Transformations with Encoder-Decoder Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[45]  Fisher Yu,et al.  3D Reconstruction from Accidental Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Thomas Brox,et al.  Learning to Generate Chairs, Tables and Cars with Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Ruigang Yang,et al.  View Extrapolation of Human Body from a Single Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Richard Szeliski,et al.  High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[50]  Marc Levoy,et al.  Light field rendering , 1996, SIGGRAPH.

[51]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Karthik Ramani,et al.  Deep Learning 3D Shape Surfaces Using Geometry Images , 2016, ECCV.

[53]  Xuming He,et al.  Geometry-Aware Deep Network for Single-Image Novel View Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Sergey Tulyakov,et al.  Transformable Bottleneck Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Zhen Wang,et al.  Multi-class Generative Adversarial Networks with the L2 Loss Function , 2016, ArXiv.

[56]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ersin Yumer,et al.  Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[59]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[60]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[61]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[63]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..