论文信息 - DreamSparse: Escaping from Plato's Cave with 2D Frozen Diffusion Model Given Sparse Views

DreamSparse: Escaping from Plato's Cave with 2D Frozen Diffusion Model Given Sparse Views

Synthesizing novel view images from a few views is a challenging but practical problem. Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings due to the insufficient information provided. In this work, we explore leveraging the strong 2D priors in pre-trained diffusion models for synthesizing novel view images. 2D diffusion models, nevertheless, lack 3D awareness, leading to distorted image synthesis and compromising the identity. To address these problems, we propose DreamSparse, a framework that enables the frozen pre-trained diffusion model to generate geometry and identity-consistent novel view image. Specifically, DreamSparse incorporates a geometry module designed to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial guidance model is introduced to convert these 3D feature maps into spatial information for the generative process. This information is then used to guide the pre-trained diffusion model, enabling it to generate geometrically consistent images without tuning it. Leveraging the strong image priors in the pre-trained diffusion models, DreamSparse is capable of synthesizing high-quality novel views for both object and scene-level images and generalising to open-set images. Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images. More results can be found on our project page: https://sites.google.com/view/dreamsparse-webpage.

S. Gu | Paul Yoo | Jiaxian Guo | Yutaka Matsuo

[1] Shalini De Mello,et al. Generative Novel View Synthesis with 3D-Aware Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Basile Van Hoorick,et al. Zero-1-to-3: Zero-shot One Image to 3D Object , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Seung Wook Kim,et al. Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation , 2023, ICLR.

[4] P. Fua,et al. GECCO: Geometrically-Conditioned Point Diffusion Models , 2023, ArXiv.

[5] Songwei Ge,et al. Text-driven Visual Synthesis with Latent Diffusion Prior , 2023, ArXiv.

[6] Maneesh Agrawala,et al. Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[7] Shubham Tulsiani,et al. Geometry-biased Transformers for Novel View Synthesis , 2023, ArXiv.

[8] Shenghua Gao,et al. Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] C. Qi,et al. NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] M. Nießner,et al. DiffRF: Rendering-Guided 3D Radiance Field Diffusion , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Shubham Tulsiani,et al. SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Paul Guerrero,et al. 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models , 2022, ArXiv.

[13] Gang Li,et al. 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models , 2022, ArXiv.

[14] S. Bagon,et al. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] R. Giryes,et al. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[17] S. Fidler,et al. LION: Latent Point Diffusion Models for 3D Shape Generation , 2022, NeurIPS.

[18] Mohammad Norouzi,et al. Novel View Synthesis with Diffusion Models , 2022, ICLR.

[19] Ben Poole,et al. DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[20] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[21] Noah Snavely,et al. IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Torsten Sattler,et al. ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers , 2022, ECCV.

[23] Pascale Fung,et al. Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[24] T. Müller,et al. Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[25] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] David B. Lindell,et al. Bacon: Band-limited Coordinate Networks for Multiscale Scene Representation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] A. Voynov,et al. Label-Efficient Semantic Segmentation with Diffusion Models , 2021, ICLR.

[28] Jonathan T. Barron,et al. RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Mehdi S. M. Sajjadi,et al. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] David J. Fleet,et al. Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[31] Deva Ramanan,et al. NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild , 2021, NeurIPS.

[32] Lourdes Agapito,et al. CodeNeRF: Disentangled Neural Radiance Fields for Object Categories , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Patrick Labatut,et al. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] S. Ermon,et al. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , 2021, ICLR.

[35] Vincent Sitzmann,et al. Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering , 2021, NeurIPS.

[36] David J. Fleet,et al. Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Patrick Esser,et al. Geometry-Free View Synthesis: Transformers and no 3D Priors , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38] Jiajun Wu,et al. 3D Shape Generation and Completion through Point-Voxel Diffusion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] Pieter Abbeel,et al. Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Hao Su,et al. MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41] Shitong Luo,et al. Diffusion Probabilistic Models for 3D Point Cloud Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Pratul P. Srinivasan,et al. IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Angjoo Kanazawa,et al. pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Pratul P. Srinivasan,et al. Learned Initializations for Optimizing Coordinate-Based Neural Representations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Gernot Riegler,et al. Stable View Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Alex Trevithick,et al. GRF: Learning a General Radiance Field for 3D Representation and Rendering , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47] Jiaming Song,et al. Denoising Diffusion Implicit Models , 2020, ICLR.

[48] Gernot Riegler,et al. Free View Synthesis , 2020, ECCV.

[49] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[50] Gordon Wetzstein,et al. MetaSDF: Meta-learning Signed Distance Functions , 2020, NeurIPS.

[51] Noah Snavely,et al. Single-View View Synthesis With Multiplane Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Pratul P. Srinivasan,et al. NeRF , 2020, ECCV.

[53] Feng Liu,et al. 3D Ken Burns effect from a single image , 2019, ACM Trans. Graph..

[54] Gordon Wetzstein,et al. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[55] Paul Debevec,et al. DeepView: View Synthesis With Learned Gradient Descent , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Gordon Wetzstein,et al. DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Graham Fyffe,et al. Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[58] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[60] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[61] Jan-Michael Frahm,et al. Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Jitendra Malik,et al. View Synthesis by Appearance Flow , 2016, ECCV.

[63] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[65] Thomas Brox,et al. Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[66] John Flynn,et al. Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Richard Szeliski,et al. Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[68] Michael Goesele,et al. Multi-View Stereo for Community Photo Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[69] Steven M. Seitz,et al. Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[70] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .