DGPose: Deep Generative Models for Human Body Analysis

Deep generative modelling for human body analysis is an emerging problem with many interesting applications. However, the latent space learned by such approaches is typically not interpretable, resulting in less flexibility. In this work, we present deep generative models for human body analysis in which the body pose and the visual appearance are disentangled. Such a disentanglement allows independent manipulation of pose and appearance, and hence enables applications such as pose-transfer without specific training for such a task. Our proposed models, the Conditional-DGPose and the Semi-DGPose, have different characteristics. In the first, body pose labels are taken as conditioners, from a fully-supervised training set. In the second, our structured semi-supervised approach allows for pose estimation to be performed by the model itself and relaxes the need for labelled data. Therefore, the Semi-DGPose aims for the joint understanding and generation of people in images. It is not only capable of mapping images to interpretable latent representations but also able to map these representations back to the image space. We compare our models with relevant baselines, the ClothNet-Body and the Pose Guided Person Generation networks, demonstrating their merits on the Human3.6M, ChictopiaPlus and DeepFashion benchmarks.

[1]  Luc Van Gool,et al.  Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Robert T. Collins,et al.  A Generative Model for Simultaneous Estimation of Human Body Shape and Pixel-Level Segmentation , 2012, ECCV.

[3]  Luc Van Gool,et al.  Learning Generative Models for Monocular Body Pose Estimation , 2007, ACCV.

[4]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[5]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Steven M. Seitz,et al.  The Visual Turing Test for Scene Reconstruction , 2013, 2013 International Conference on 3D Vision.

[9]  Karl F. MacDorman,et al.  Too real for comfort? Uncanny responses to computer generated faces , 2009, Comput. Hum. Behav..

[10]  Dariu Gavrila,et al.  A mixed generative-discriminative framework for pedestrian classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[12]  Peter V. Gehler,et al.  A Generative Model of People in Clothing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Takeo Kanade,et al.  Virtualized Reality: Constructing Virtual Worlds from Real Scenes , 1997, IEEE Multim..

[14]  John P. Lewis,et al.  Universal capture: image-based facial animation for "The Matrix Reloaded" , 2003, SIGGRAPH '03.

[15]  Bernard Ghanem,et al.  Sim4CV: A Photo-Realistic Simulator for Computer Vision Applications , 2017, International Journal of Computer Vision.

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Philip H. S. Torr,et al.  A Conditional Deep Generative Model of People in Natural Images , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[20]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[22]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[23]  Nassir Navab,et al.  Patient MoCap: Human Pose Estimation Under Blanket Occlusion for Hospital Monitoring Applications , 2016, MICCAI.

[24]  Daniel Thalmann,et al.  Handbook of Virtual Humans , 2004 .

[25]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[26]  Ahmed M. Elgammal,et al.  High Resolution Acquisition, Learning and Transfer of Dynamic 3‐D Facial Expressions , 2004, Comput. Graph. Forum.

[27]  Michael J. Black,et al.  FAUST: Dataset and Evaluation for 3D Mesh Registration , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cordelia Schmid,et al.  Image-Based Synthesis for Deep 3D Human Pose Estimation , 2018, International Journal of Computer Vision.

[29]  J. Gallant,et al.  Identifying natural images from human brain activity , 2008, Nature.

[30]  Mark Pauly,et al.  Dynamic 3D avatar creation from hand-held video input , 2015, ACM Trans. Graph..

[31]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jian Dong,et al.  Deep Human Parsing with Active Template Regression , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Takeo Kanade,et al.  Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance , 2018, International Journal of Computer Vision.

[35]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[37]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[38]  Sehoon Ha,et al.  Iterative Training of Dynamic Skills Inspired by Human Coaching Techniques , 2014, ACM Trans. Graph..

[39]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[42]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[43]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[46]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[47]  Edmond Boyer,et al.  Fusion of multiview silhouette cues using a space occupancy grid , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[48]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Vincent Lepetit,et al.  Bridging the Gap between Detection and Tracking for 3D Monocular Video-Based Motion Capture , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Eero P. Simoncelli,et al.  Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[51]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[52]  Stan Sclaroff,et al.  3D hand pose reconstruction using specialized mappings , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[53]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[54]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[55]  Bodo Rosenhahn,et al.  Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs , 2017, Comput. Graph. Forum.

[56]  Cristian Sminchisescu,et al.  Human Appearance Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Adrian Hilton,et al.  Video-based character animation , 2005, SCA '05.

[59]  Philip H. S. Torr,et al.  A Semi-supervised Deep Generative Model for Human Body Analysis , 2018, ECCV Workshops.

[60]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  C. Umilta,et al.  Face preference at birth. , 1996, Journal of experimental psychology. Human perception and performance.

[62]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[63]  Daniel Thalmann,et al.  Virtual humans: thirty years of research, what next? , 2005, The Visual Computer.

[64]  Takeo Kanade,et al.  Learning scene-specific pedestrian detectors without real data , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Fetter,et al.  A Progression of Human Figures Simulated by Computer Graphics , 1982, IEEE Computer Graphics and Applications.

[66]  Tony Ezzat,et al.  Facial analysis and synthesis using image-based models , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[67]  Ahmed M. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[68]  Jonathan S. Herberg,et al.  Image Visual Realism: From Human Perception to Machine Computation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  W. Geisler Visual perception and the statistical properties of natural scenes. , 2008, Annual review of psychology.

[70]  Rainer Stiefelhagen,et al.  Head pose estimation using stereo vision for human-robot interaction , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[71]  Georgios Tzimiropoulos,et al.  Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[72]  Adrian Hilton,et al.  Surface Capture for Performance-Based Animation , 2007, IEEE Computer Graphics and Applications.

[73]  Rómer Rosales,et al.  3D Hand Pose Reconstruction Using Specialized Mappings , 2001, ICCV.

[74]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[76]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[77]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[78]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[79]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[80]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[81]  Geoffrey E. Hinton,et al.  Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images , 2010, AISTATS.

[82]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[83]  Philip H. S. Torr,et al.  FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[85]  Justus Thies,et al.  Demo of Face2Face: real-time face capture and reenactment of RGB videos , 2016, SIGGRAPH Emerging Technologies.

[86]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[87]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[88]  Wei Sun,et al.  Virtual people: capturing human models to populate virtual worlds , 1999, Proceedings Computer Animation 1999.

[89]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[90]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[91]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[92]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[93]  BlakeAndrew,et al.  Real-time human pose recognition in parts from single depth images , 2013 .

[94]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[95]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[96]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[97]  Frank D. Wood,et al.  Learning Disentangled Representations with Semi-Supervised Deep Generative Models , 2017, NIPS.

[98]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[99]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[100]  Rómer Rosales,et al.  Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation , 2006, International Journal of Computer Vision.

[101]  Ahmed M. Elgammal,et al.  Facial Expression Analysis Using Nonlinear Decomposable Generative Models , 2005, AMFG.

[102]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[103]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[104]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[105]  PaulyMark,et al.  Dynamic 3D avatar creation from hand-held video input , 2015 .

[106]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Philip H. S. Torr,et al.  Deep Fully-Connected Part-Based Models for Human Pose Estimation , 2018, ACML.

[108]  Francesc Moreno-Noguer,et al.  Unsupervised Person Image Synthesis in Arbitrary Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[109]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[110]  Adrian Hilton,et al.  Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling , 2018, ECCV.

[111]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).