AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose

Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io

[1]  Eduard Gabriel Bazavan,et al.  DreamHuman: Animatable 3D Avatars from Text , 2023, NeurIPS.

[2]  K. Han,et al.  HeadSculpt: Crafting 3D Head Avatars with Text , 2023, ArXiv.

[3]  Hang Su,et al.  ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , 2023, NeurIPS.

[4]  Lei Zhang,et al.  DreamWaltz: Make a Scene with Complex 3D Animatable Avatars , 2023, ArXiv.

[5]  M. Nießner,et al.  HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion , 2023, ACM Trans. Graph..

[6]  Hongwen Zhang,et al.  AvatarReX: Real-time Expressive Full-body Avatars , 2023, ACM Trans. Graph..

[7]  Hongwen Zhang,et al.  StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video , 2023, SIGGRAPH.

[8]  Yebin Liu,et al.  PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling , 2023, SIGGRAPH.

[9]  Kwan-Yee Kenneth Wong,et al.  DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models , 2023, ArXiv.

[10]  Lan Xu,et al.  DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance , 2023, ACM Trans. Graph..

[11]  Dongdong Chen,et al.  AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control , 2023, ArXiv.

[12]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[13]  R. Giryes,et al.  TEXTure: Text-Guided Texturing of 3D Shapes , 2023, SIGGRAPH.

[14]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[15]  Jie Song,et al.  InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xun Huang,et al.  Magic3D: High-Resolution Text-to-3D Content Creation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  R. Giryes,et al.  Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[19]  C. Theobalt,et al.  Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction , 2022, ICLR.

[20]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[21]  T. Popa,et al.  CLIP-Mesh: Generating textured meshes from text using pretrained image-text models , 2022, SIGGRAPH Asia.

[22]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hwann-Tzong Chen,et al.  Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Sanja Fidler,et al.  Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis , 2021, NeurIPS.

[26]  Hang Chu,et al.  CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dongdong Chen,et al.  Cross-Domain and Disentangled Face Manipulation With 3D Guidance , 2021, IEEE Transactions on Visualization and Computer Graphics.

[28]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[29]  Kyaw Zaw Lin,et al.  Neural Sparse Voxel Fields , 2020, NeurIPS.

[30]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[33]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[35]  Michael J. Black,et al.  ECON: Explicit Clothed humans Obtained from Normals , 2022, ArXiv.