Textured Neural Avatars

We present a system for learning full body neural avatars, i.e. deep networks that produce full body renderings of a person for varying body pose and varying camera pose. Our system takes the middle path between the classical graphics pipeline and the recent deep learning approaches that generate images of humans using image-to-image translation. In particular, our system estimates an explicit two-dimensional texture map of the model surface. At the same time, it abstains from explicit shape modeling in 3D. Instead, at test time, the system uses a fully-convolutional network to directly map the configuration of body feature points w.r.t. the camera to the 2D texture coordinates of individual pixels in the image frame. We show that such system is capable of learning to generate realistic renderings while being trained on videos annotated with 3D poses and foreground masks. We also demonstrate that maintaining an explicit texture representation helps our system to achieve better generalization compared to systems that use direct image-to-image translation.

[1]  Michael J. Black,et al.  Detailed Full-Body Reconstructions of Moving People from Monocular RGB-D Sequences , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Michael J. Black,et al.  The Naked Truth: Estimating Body Shape Under Clothing , 2008, ECCV.

[4]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Hans-Peter Seidel,et al.  Multilinear pose and body shape estimation of dressed subjects from image sets , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Peter Robinson,et al.  Rendering of Eyes for Eye-Shape Registration and Gaze Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Michael J. Black,et al.  Home 3D body scans from noisy image and range data , 2011, 2011 International Conference on Computer Vision.

[11]  Christian Theobalt,et al.  Neural Rendering and Reenactment of Human Actor Videos , 2018, ACM Trans. Graph..

[12]  Daniel Cremers,et al.  Superresolution texture maps for multiview reconstruction , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Edilson de Aguiar,et al.  Multi-view Performance Capture of Surface Details , 2017, International Journal of Computer Vision.

[14]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Ian Stavness,et al.  Unified skinning of rigid and deformable models for anatomical simulations , 2014, SIGGRAPH ASIA Technical Briefs.

[17]  Adrian Hilton,et al.  Optimal Representation of Multiple View Video , 2014, BMVC.

[18]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[19]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[20]  Andrea Vedaldi,et al.  Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[21]  Qionghai Dai,et al.  DoubleFusion: Real-Time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Victor S. Lempitsky,et al.  Seamless Mosaicing of Image-Based Texture Maps , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Zhenan Sun,et al.  Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization , 2018, NeurIPS.

[25]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[26]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[27]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Marcus A. Magnor,et al.  Detailed Human Avatars from Monocular Video , 2018, 2018 International Conference on 3D Vision (3DV).

[29]  Yong-Liang Yang,et al.  RenderNet: A deep convolutional network for differentiable rendering from 3D shapes , 2018, NeurIPS.

[30]  Adrian Hilton,et al.  Model-based multiple view reconstruction of people , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jian Dong,et al.  Deep Human Parsing with Active Template Regression , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yaser Sheikh,et al.  Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[35]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[37]  Derek Bradley,et al.  Recent Advances in Facial Appearance Capture , 2015, Comput. Graph. Forum.

[38]  Luc Van Gool,et al.  Markerless tracking of complex human motions from multiple views , 2006, Comput. Vis. Image Underst..

[39]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[40]  Yizhou Yu,et al.  Efficient View-Dependent Image-Based Rendering with Projective Texture-Mapping , 1998, Rendering Techniques.

[41]  J. Collomosse,et al.  4D video textures for interactive character appearance , 2014, Comput. Graph. Forum.

[42]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[43]  Iasonas Kokkinos,et al.  DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[46]  Dmitry Ulyanov,et al.  Image Manipulation with Perceptual Discriminators , 2018, ECCV.

[47]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[48]  Wan-Chun Ma,et al.  The Digital Emily Project: Achieving a Photorealistic Digital Actor , 2010, IEEE Computer Graphics and Applications.

[49]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[51]  Ersin Yumer,et al.  Real-Time Hair Rendering Using Sequential Adversarial Networks , 2018, ECCV.

[52]  Pushmeet Kohli,et al.  Unwrap mosaics: a new representation for video editing , 2008, SIGGRAPH 2008.

[53]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[54]  Jan Kautz,et al.  Video-based characters: creating new human performances from a multi-view video database , 2011, SIGGRAPH 2011.

[55]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ari Shapiro,et al.  Avatar reshaping and automatic rigging using a deformable model , 2015, MIG.

[57]  Shahram Izadi,et al.  Motion2fusion , 2017, ACM Trans. Graph..

[58]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Steven M. Seitz,et al.  LookinGood , 2018, ACM Trans. Graph..

[61]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  M. Gross,et al.  Analysis of human faces using a measurement-based skin reflectance model , 2006, ACM Trans. Graph..

[63]  Michael J. Black,et al.  Dyna: a model of dynamic human shape in motion , 2015, ACM Trans. Graph..

[64]  Victor S. Lempitsky,et al.  DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation , 2016, ECCV.

[65]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[67]  Wenping Wang,et al.  Neural Animation and Reenactment of Human Actor Videos , 2018, ArXiv.

[68]  Tim Weyrich,et al.  A layered, heterogeneous reflectance model for acquiring and rendering human skin , 2008, SIGGRAPH Asia '08.

[69]  Behzad Dariush,et al.  Controlled human pose estimation from depth image streams , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[70]  Alvaro Collet,et al.  High-quality streamable free-viewpoint video , 2015, ACM Trans. Graph..

[71]  Iasonas Kokkinos,et al.  Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance , 2018, ECCV.

[72]  Dani Lischinski,et al.  Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[73]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[74]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[75]  M. Mori THE UNCANNY VALLEY , 2020, The Monster Theory Reader.

[76]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[77]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.