DreamHuman: Animatable 3D Avatars from Text

We present DreamHuman, a method to generate realistic animatable 3D human avatar models solely from textual descriptions. Recent text-to-3D methods have made considerable strides in generation, but are still lacking in important aspects. Control and often spatial resolution remain limited, existing methods produce fixed rather than animated 3D human models, and anthropometric consistency for complex structures like people remains a challenge. DreamHuman connects large text-to-image synthesis models, neural radiance fields, and statistical human body models in a novel modeling and optimization framework. This makes it possible to generate dynamic 3D human avatars with high-quality textures and learned, instance-specific, surface deformations. We demonstrate that our method is capable to generate a wide variety of animatable, realistic 3D human models from text. Our 3D models have diverse appearance, clothing, skin tones and body shapes, and significantly outperform both generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity. For more results and animations please check our website at https://dream-human.github.io.

[1]  M. Nießner,et al.  HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion , 2023, ACM Trans. Graph..

[2]  Hongwen Zhang,et al.  LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar , 2023, SIGGRAPH.

[3]  Yebin Liu,et al.  PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling , 2023, SIGGRAPH.

[4]  Ben Poole,et al.  DreamBooth3D: Subject-Driven Text-to-3D Generation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Quoc V. Le,et al.  Noise2Music: Text-conditioned Music Generation with Diffusion Models , 2023, ArXiv.

[6]  R. Giryes,et al.  TEXTure: Text-Guided Texturing of 3D Shapes , 2023, SIGGRAPH.

[7]  Timo I. Denk,et al.  MusicLM: Generating Music From Text , 2023, ArXiv.

[8]  Naman Goyal,et al.  Text-To-4D Dynamic Scene Generation , 2023, ICML.

[9]  W. Freeman,et al.  Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[10]  Xun Huang,et al.  Magic3D: High-Resolution Text-to-3D Content Creation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  D. Cohen-Or,et al.  Null-text Inversion for Editing Real Images using Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  R. Giryes,et al.  Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  M. Irani,et al.  Imagic: Text-Based Real Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[15]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[16]  Sungjoon Choi,et al.  FLAME: Free-form Language-based Motion Synthesis & Editing , 2022, AAAI.

[17]  Michael J. Black,et al.  TEACH: Temporal Action Composition for 3D Humans , 2022, 2022 International Conference on 3D Vision (3DV).

[18]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[20]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[21]  Sen Wang,et al.  Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[23]  Zhongang Cai,et al.  AvatarCLIP , 2022, ACM Trans. Graph..

[24]  Michael J. Black,et al.  TEMOS: Generating diverse human motions from textual descriptions , 2022, ECCV.

[25]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[26]  T. Popa,et al.  CLIP-Mesh: Generating textured meshes from text using pretrained image-text models , 2022, SIGGRAPH Asia.

[27]  Amit H. Bermano,et al.  MotionCLIP: Exposing Human Motion Generation to CLIP Space , 2022, ECCV.

[28]  Pratul P. Srinivasan,et al.  HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Cristian Sminchisescu,et al.  HSPACE: Synthetic Parametric Humans Animated in Complex Environments , 2021, ArXiv.

[30]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[32]  Jonathan T. Barron,et al.  Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jiakai Zhang,et al.  HumanNeRF: Efficiently Generated Human Radiance Field from Sparse Inputs , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Pratul P. Srinivasan,et al.  Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hongyi Xu,et al.  H-NeRF: Neural Radiance Fields for Rendering and Temporal Reconstruction of Humans in Motion , 2021, NeurIPS.

[37]  Hang Chu,et al.  CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[39]  Hongyi Xu,et al.  imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Jonathan T. Barron,et al.  HyperNeRF , 2021, ACM Trans. Graph..

[41]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[42]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Stephen Lin,et al.  Neural Articulated Radiance Field , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[45]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[46]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[47]  M. Zollhöfer,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[51]  Cristian Sminchisescu,et al.  GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[53]  Cristian Sminchisescu,et al.  Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows , 2020, ECCV.

[54]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[55]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[56]  Nicola Pezzotti,et al.  Differentiable Image Parameterizations , 2018, Distill.

[57]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[58]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[59]  Michael J. Black,et al.  ClothCap , 2017, ACM Trans. Graph..

[60]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[61]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[62]  Pat Hanrahan,et al.  An efficient representation for irradiance environment maps , 2001, SIGGRAPH.