HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images

Diffusion models have emerged as the best approach for generative modeling of 2D images. Part of their success is due to the possibility of training them on millions if not billions of images with a stable learning objective. However, extending these models to 3D remains difficult for two reasons. First, finding a large quantity of 3D training data is much more complex than for 2D images. Second, while it is conceptually trivial to extend the models to operate on 3D rather than 2D grids, the associated cubic growth in memory and compute complexity makes this infeasible. We address the first challenge by introducing a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision; and the second challenge by proposing an image formation model that decouples model memory from spatial memory. We evaluate our method on real-world data, using the CO3D dataset which has not been used to train 3D generative models before. We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.

[1]  T. Goldstein,et al.  Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xun Huang,et al.  Magic3D: High-Resolution Text-to-3D Content Creation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Mohammad Norouzi,et al.  Novel View Synthesis with Diffusion Models , 2022, ICLR.

[4]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[5]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[6]  S. Fidler,et al.  GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images , 2022, NeurIPS.

[7]  Chi-Wing Fu,et al.  Neural Wavelet-domain Diffusion for 3D Shape Generation , 2022, SIGGRAPH Asia.

[8]  N. Mitra,et al.  3inGAN: Learning a 3D Generative Model from Images of a Self-similar Scene , 2022, 2022 International Conference on 3D Vision (3DV).

[9]  Calvin Luo Understanding Diffusion Models: A Unified Perspective , 2022, ArXiv.

[10]  Walter A. Talbott,et al.  GAUDI: A Neural Architect for Immersive 3D Scene Generation , 2022, NeurIPS.

[11]  Peter Wonka,et al.  EpiGRAF: Rethinking training of 3D GANs , 2022, NeurIPS.

[12]  Andreas Geiger,et al.  VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids , 2022, NeurIPS.

[13]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[14]  N. Mitra,et al.  ReLU Fields: The Little Non-linearity That Could , 2022, SIGGRAPH.

[15]  Torsten Sattler,et al.  ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers , 2022, ECCV.

[16]  Andreas Geiger,et al.  TensoRF: Tensorial Radiance Fields , 2022, ECCV.

[17]  Tim Salimans,et al.  Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[18]  T. Müller,et al.  Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[19]  A. Vedaldi,et al.  BANMo: Building Animatable 3D Neural Models from Many Casual Videos , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jeong Joon Park,et al.  StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Shalini De Mello,et al.  Efficient Geometry-aware 3D Generative Adversarial Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Benjamin Recht,et al.  Plenoxels: Radiance Fields without Neural Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonathan T. Barron,et al.  Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pratul P. Srinivasan,et al.  Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hwann-Tzong Chen,et al.  Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[28]  Christian Theobalt,et al.  StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis , 2021, ICLR.

[29]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[30]  Patrick Labatut,et al.  Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  A. Vedaldi,et al.  DOVE: Learning Deformable 3D Objects by Watching Videos , 2021, International Journal of Computer Vision.

[32]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[33]  Nitish Srivastava,et al.  Unconstrained Scene Generation with Locally Conditioned Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Tobias Ritschel,et al.  Unsupervised Learning of 3D Object Categories from Videos in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Pratul P. Srinivasan,et al.  Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Shitong Luo,et al.  Diffusion Probabilistic Models for 3D Point Cloud Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Pratul P. Srinivasan,et al.  IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Angjoo Kanazawa,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jiajun Wu,et al.  pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andreas Geiger,et al.  GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Stefano Ermon,et al.  SDEdit: Image Synthesis and Editing with Stochastic Differential Equations , 2021, ArXiv.

[42]  Andrea Vedaldi,et al.  Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction , 2020, NeurIPS.

[43]  Andreas Geiger,et al.  GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis , 2020, NeurIPS.

[44]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[45]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[46]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[47]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[48]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[49]  Yong-Liang Yang,et al.  BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images , 2020, NeurIPS.

[50]  A. Vedaldi,et al.  Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[52]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[53]  N. Mitra,et al.  Escaping Plato’s Cave: 3D Shape From Adversarial Rendering , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[55]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[56]  Mathieu Aubry,et al.  A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[58]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[61]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[65]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.