DreamFusion: Text-to-3D using 2D Diffusion

Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D data and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circum-vent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors. See dreamfusion3d.github.io for a more immersive view into our 3D results.

[1]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[2]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[3]  C. Theobalt,et al.  GAN2X: Non-Lambertian Inverse Rendering of Image GANs , 2022, International Conference on 3D Vision.

[4]  D. Samaras,et al.  Diffusion models as plug-and-play priors , 2022, ArXiv.

[5]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[6]  Zhongang Cai,et al.  AvatarCLIP , 2022, ACM Trans. Graph..

[7]  Yi Wang,et al.  Towards Implicit Text-Guided 3D Shape Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  T. Popa,et al.  CLIP-Mesh: Generating textured meshes from text using pretrained image-text models , 2022, SIGGRAPH Asia.

[9]  Jeong Joon Park,et al.  StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[11]  Shalini De Mello,et al.  Efficient Geometry-aware 3D Generative Adversarial Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Dongdong Chen,et al.  CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jonathan T. Barron,et al.  Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sagie Benaim,et al.  Text2Mesh: Text-Driven Neural Stylization for Meshes , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Pratul P. Srinivasan,et al.  Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[18]  Christian Theobalt,et al.  StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis , 2021, ICLR.

[19]  Hang Chu,et al.  CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[21]  Nikolay Jetchev,et al.  ClipMatrix: Text-controlled Creation of 3D Textured Meshes , 2021, ArXiv.

[22]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[23]  Jiajun Wu,et al.  3D Shape Generation and Completion through Point-Voxel Diffusion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Pratul P. Srinivasan,et al.  Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[26]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[27]  Jonathan T. Barron,et al.  NeRD: Neural Reflectance Decomposition from Image Collections , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Jonathan T. Barron,et al.  NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jiajun Wu,et al.  pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[31]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[33]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[34]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[35]  Noah Snavely,et al.  Learning Gradient Fields for Shape Generation , 2020, ECCV.

[36]  Yannick Hold-Geoffroy,et al.  Neural Reflectance Fields for Appearance Acquisition , 2020, ArXiv.

[37]  Andreas Geiger,et al.  GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis , 2020, NeurIPS.

[38]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[39]  Aaron C. Courville,et al.  AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation , 2020, ICML.

[40]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[41]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[42]  Richard Zemel,et al.  Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling , 2020, ICML.

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[45]  Ming-Yu Liu,et al.  PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[47]  N. Mitra,et al.  Escaping Plato’s Cave: 3D Shape From Adversarial Rendering , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[49]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[50]  Alexandre Lacoste,et al.  Probability Distillation: A Caveat and Alternatives , 2019, UAI.

[51]  Nicola Pezzotti,et al.  Differentiable Image Parameterizations , 2018, Distill.

[52]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[53]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[54]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[55]  David Duvenaud,et al.  Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference , 2017, NIPS.

[56]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[57]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[60]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[61]  Pat Hanrahan,et al.  A signal-processing framework for inverse rendering , 2001, SIGGRAPH.

[62]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .