Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects or zoom-out trajectories from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input.

[1]  A. Vedaldi,et al.  RealFusion 360° Reconstruction of Any Object from a Single Image , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  C. Theobalt,et al.  NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion , 2023, ICML.

[3]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[4]  Quoc V. Le,et al.  Noise2Music: Text-conditioned Music Generation with Diffusion Models , 2023, ArXiv.

[5]  R. Giryes,et al.  TEXTure: Text-Guided Texturing of 3D Shapes , 2023, SIGGRAPH.

[6]  Tali Dekel,et al.  SceneScape: Text-Driven Consistent Scene Generation , 2023, ArXiv.

[7]  B. Schölkopf,et al.  Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion , 2023, ArXiv.

[8]  Naman Goyal,et al.  Text-To-4D Dynamic Scene Generation , 2023, ICML.

[9]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Prafulla Dhariwal,et al.  Point-E: A System for Generating 3D Point Clouds from Complex Prompts , 2022, ArXiv.

[11]  Dongdong Chen,et al.  NeRF-Art: Text-Driven Neural Radiance Fields Stylization , 2022, IEEE transactions on visualization and computer graphics.

[12]  A. Schwing,et al.  SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xiaojun Chang,et al.  3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation , 2022, AAAI.

[14]  Paul Guerrero,et al.  3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models , 2022, ArXiv.

[15]  Raymond A. Yeh,et al.  Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yifan Jiang,et al.  NeuralLift-360: Lifting an in-the-Wild 2D Photo to A 3D Object with 360° Views , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Gang Li,et al.  3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models , 2022, ArXiv.

[18]  Xun Huang,et al.  Magic3D: High-Resolution Text-to-3D Content Creation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hakan Bilen,et al.  RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  R. Giryes,et al.  Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Cheng Lu,et al.  DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models , 2022, ArXiv.

[23]  Rui Chen,et al.  TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition , 2022, NeurIPS.

[24]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[25]  R. Cipolla,et al.  IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty , 2022, BMVC.

[26]  Mohammad Norouzi,et al.  Novel View Synthesis with Diffusion Models , 2022, ICLR.

[27]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[28]  Angel X. Chang,et al.  Understanding Pure CLIP Guidance for Voxel Grid NeRF Models , 2022, ArXiv.

[29]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[30]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[31]  S. Scherer,et al.  360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance , 2022, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Guangcong Wang,et al.  Text2Light , 2022, ACM Trans. Graph..

[33]  Walter A. Talbott,et al.  GAUDI: A Neural Architect for Immersive 3D Scene Generation , 2022, NeurIPS.

[34]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[35]  Noah Snavely,et al.  InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images , 2022, ECCV.

[36]  Christian Richardt,et al.  360MonoDepth: High-Resolution 360° Monocular Depth Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[38]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[39]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[40]  T. Popa,et al.  CLIP-Mesh: Generating textured meshes from text using pretrained image-text models , 2022, SIGGRAPH Asia.

[41]  Tatsuya Harada,et al.  Enhancement of Novel View Synthesis Using Omnidirectional Image Completion , 2022, ArXiv.

[42]  Xiaolong Wang,et al.  Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[45]  Dongdong Chen,et al.  CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Sagie Benaim,et al.  Text2Mesh: Text-Driven Neural Stylization for Meshes , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Hang Chu,et al.  CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  L. Gool,et al.  DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models , 2022, ArXiv.

[52]  Huajian Huang,et al.  360Roam: Real-Time Indoor Roaming Using Geometry-Aware 360𝓁 Radiance Fields , 2022, ArXiv.

[53]  Karan Desai,et al.  RedCaps: web-curated image-text data created by the people, for the people , 2021, NeurIPS Datasets and Benchmarks.

[54]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[55]  David F. Fouhey,et al.  PixelSynth: Generating a 3D-Consistent Experience from a Single Image , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Hwann-Tzong Chen,et al.  Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama , 2021, ArXiv.

[57]  Jan Kautz,et al.  Score-based Generative Modeling in Latent Space , 2021, NeurIPS.

[58]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[59]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Shenghua Gao,et al.  Layout-Guided Novel View Synthesis from a Single Indoor Panorama , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[62]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[63]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Varun Jampani,et al.  Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[66]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[67]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[68]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[69]  Shugong Xu,et al.  Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Jia-Bin Huang,et al.  3D Photography Using Context-Aware Layered Depth Inpainting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[72]  R. Szeliski,et al.  SynSin: End-to-End View Synthesis From a Single Image , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Wan-Yen Lo,et al.  Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[74]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[75]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[76]  Matthias Nießner,et al.  Plan3D , 2017, ACM Trans. Graph..

[77]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[78]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[79]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[80]  Pat Hanrahan,et al.  Submodular Trajectory Optimization for Aerial 3D Scanning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[83]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[84]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[85]  Antonio Torralba,et al.  Infinite Images: Creating and Exploring a Large Photorealistic Virtual Space , 2008, Proceedings of the IEEE.

[86]  Michael M. Kazhdan,et al.  Poisson surface reconstruction , 2006, SGP '06.

[87]  Alexandru Telea,et al.  An Image Inpainting Technique Based on the Fast Marching Method , 2004, J. Graphics, GPU, & Game Tools.