Sequential View Synthesis with Transformer

This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints. Using the known query pose and input poses, we create an ordered set of observations that leads to the target view. Thus, the problem of single novel view synthesis is reformulated as a sequential view prediction task. In this paper, the proposed Transformer-based Generative Query Network (T-GQN) extends the neural-rendering methods by adding two new concepts. First, we use multi-view attention learning between context images to obtain multiple implicit scene representations. Second, we introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations. Finally, we evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn't require any retraining for finetuning.

[1]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[2]  Yee Whye Teh,et al.  Attentive Neural Processes , 2019, ICLR.

[3]  Noah Snavely,et al.  Single-View View Synthesis With Multiplane Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Janne Heikkilä,et al.  Predicting Novel Views Using Generative Adversarial Query Network , 2019, SCIA.

[5]  Ting-Chun Wang,et al.  Learning-based view synthesis for light field cameras , 2016, ACM Trans. Graph..

[6]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Sergey Tulyakov,et al.  Transformable Bottleneck Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yee Whye Teh,et al.  Conditional Neural Processes , 2018, ICML.

[10]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Thu Nguyen-Phuoc,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[13]  Ning Zhang,et al.  Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence , 2018, ECCV.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Murray Shanahan,et al.  Consistent Generative Query Networks , 2018, ArXiv.

[17]  Diederik P. Kingma,et al.  An Introduction to Variational Autoencoders , 2019, Found. Trends Mach. Learn..

[18]  Jonathan T. Barron,et al.  NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , 2020, ECCV.

[19]  Jan Kautz,et al.  Extreme View Synthesis , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[21]  Daan Wierstra,et al.  Towards Conceptual Compression , 2016, NIPS.

[22]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..

[23]  Yaser Sheikh,et al.  Neural volumes , 2019, ACM Trans. Graph..

[24]  Masayuki Tanimoto,et al.  FTV: Free-viewpoint Television , 2006, Signal Process. Image Commun..

[25]  Ravi Ramamoorthi,et al.  Local light field fusion , 2019, ACM Trans. Graph..

[26]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[27]  Dirk Heylen,et al.  With a little help from a holographic friend: the OpenIMPRESS mixed reality telepresence toolkit for remote collaboration systems , 2018, VRST.

[28]  Carlos Guestrin,et al.  Equivariant Neural Rendering , 2020, ICML.

[29]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[30]  Yuan Chang,et al.  A review on image-based rendering , 2019, Virtual Real. Intell. Hardw..

[31]  Pieter Abbeel,et al.  Geometry-Aware Neural Rendering , 2019, NeurIPS.

[32]  Zhengqi Li,et al.  Crowdsampling the Plenoptic Function , 2020, ECCV.

[33]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Federica Battisti,et al.  Layered approach for improving the quality of free-viewpoint depth-image-based rendering images , 2019, J. Electronic Imaging.

[35]  Fabio Viola,et al.  Learning models for visual 3D localization with implicit mapping , 2018, ArXiv.

[36]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[37]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Juan Liu,et al.  Real-time mixed-reality telepresence via 3D reconstruction with HoloLens and commodity depth sensors , 2017, ICMI.

[39]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[40]  Thomas Brox,et al.  Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[41]  Jie Song,et al.  Monocular Neural Image Based Rendering With Continuous View Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).