Driving-signal aware full-body avatars

We present a learning-based method for building driving-signal aware full-body avatars. Our model is a conditional variational autoencoder that can be animated with incomplete driving signals, such as human pose and facial keypoints, and produces a high-quality representation of human geometry and view-dependent appearance. The core intuition behind our method is that better drivability and generalization can be achieved by disentangling the driving signals and remaining generative factors, which are not available during animation. To this end, we explicitly account for information deficiency in the driving signal by introducing a latent space that exclusively captures the remaining information, thus enabling the imputation of the missing factors required during full-body animation, while remaining faithful to the driving signal. We also propose a learnable localized compression for the driving signal which promotes better generalization, and helps minimize the influence of global chance-correlations often found in real datasets. For a given driving signal, the resulting variational model produces a compact space of uncertainty for missing factors that allows for an imputation strategy best suited to a particular application. We demonstrate the efficacy of our approach on the challenging problem of full-body animation for virtual telepresence with driving signals acquired from minimal sensors placed in the environment and mounted on a VR-headset.

[1]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[2]  Ken-ichi Anjyo,et al.  Practice and Theory of Blendshape Facial Models , 2014, Eurographics.

[3]  Michael J. Black,et al.  STAR: Sparse Trained Articulated Human Body Regressor , 2020, ECCV.

[4]  Gavin S. P. Miller,et al.  Efficient algorithms for local and global accessibility shading , 1994, SIGGRAPH.

[5]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Michael J. Black,et al.  SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Fernando De la Torre,et al.  Interactive region-based linear 3D face models , 2011, SIGGRAPH 2011.

[8]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Christian Theobalt,et al.  Neural Rendering and Reenactment of Human Actor Videos , 2018, ACM Trans. Graph..

[10]  Wan-Chun Ma,et al.  The Digital Emily Project: Achieving a Photorealistic Digital Actor , 2010, IEEE Computer Graphics and Applications.

[11]  Juyong Zhang,et al.  Learning 3D Human Body Embedding , 2019, ArXiv.

[12]  Wei Wang,et al.  Multistage Adversarial Losses for Pose-Based Human Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Takaaki Shiratori,et al.  DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling , 2020, ECCV.

[15]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[16]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[17]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[18]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Marcus A. Magnor,et al.  Detailed Human Avatars from Monocular Video , 2018, 2018 International Conference on 3D Vision (3DV).

[20]  Kaiming He,et al.  PointRend: Image Segmentation As Rendering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[22]  Yaser Sheikh,et al.  VR facial animation via multiview image translation , 2019, ACM Trans. Graph..

[23]  Francesc Moreno-Noguer,et al.  Unsupervised Person Image Synthesis in Arbitrary Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Olga Sorkine-Hornung,et al.  On Linear Variational Surface Deformation Methods , 2008, IEEE Transactions on Visualization and Computer Graphics.

[25]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[26]  Michael J. Black,et al.  Learning to Dress 3D People in Generative Clothing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[28]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[29]  Sebastian Thrun,et al.  Video-based reconstruction of animatable human characters , 2010, ACM Trans. Graph..

[30]  Christian Theobalt,et al.  HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization , 2020, ECCV.

[31]  Jason M. Saragih,et al.  The Eyes Have It: An Integrated Eye and Face Model for Photorealistic Facial Animation , 2020 .

[32]  Hujun Bao,et al.  Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[35]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[36]  Christian Rössl,et al.  Laplacian surface editing , 2004, SGP '04.

[37]  Björn Ommer,et al.  Towards Learning a Realistic Rendering of Human Behavior , 2018, ECCV Workshops.

[38]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[39]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Olga Sorkine-Hornung,et al.  Stretchable and Twistable Bones for Skeletal Shape Deformation , 2011, ACM Trans. Graph..

[41]  Trevor Darrell,et al.  Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Hao Li,et al.  Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Michael J. Black,et al.  Generating 3D faces using Convolutional Mesh Autoencoders , 2018, ECCV.

[46]  Jirí Zára,et al.  Geometric skinning with approximate dual quaternion blending , 2008, TOGS.

[47]  Pascal Fua,et al.  MeshSDF: Differentiable Iso-Surface Extraction , 2020, NeurIPS.

[48]  Bharat Lal Bhatnagar,et al.  Unsupervised Shape and Pose Disentanglement for 3D Meshes , 2020, ECCV.

[49]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[50]  Yaser Sheikh,et al.  Modeling Facial Geometry Using Compositional VAEs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Dani Lischinski,et al.  Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[52]  Konrad Schindler,et al.  Massively Parallel Multiview Stereopsis by Surface Normal Diffusion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[54]  Derek Bradley,et al.  An anatomically-constrained local deformation model for monocular face capture , 2016, ACM Trans. Graph..

[55]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.

[56]  Jan Kautz,et al.  NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[57]  Harry Shum,et al.  Face poser: Interactive modeling of 3D facial expressions using facial priors , 2009, TOGS.

[58]  Yaser Sheikh,et al.  Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels , 2020, NeurIPS.

[59]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[60]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  John P. Lewis,et al.  Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation , 2000, SIGGRAPH.

[62]  Fernando De la Torre,et al.  Interactive region-based linear 3D face models , 2011, ACM Trans. Graph..

[63]  Victor Lempitsky,et al.  Textured Neural Avatars , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[65]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Christian Theobalt,et al.  Neural Re-rendering of Humans from a Single Image , 2021, ECCV.

[67]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[69]  Michael J. Black,et al.  DRAPE , 2012, ACM Trans. Graph..

[70]  Mathieu Aubry,et al.  A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71]  Michael J. Black,et al.  SMPLpix: Neural Avatars from 3D Human Models , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[72]  Yaser Sheikh,et al.  Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[73]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[74]  Daniel Thalmann,et al.  Joint-dependent local deformations for hand animation and object grasping , 1989 .

[75]  Christopher Kulla,et al.  Physically based shading in theory and practice , 2014, SIGGRAPH '14.