论文信息 - Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed. In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.

[1] Thomas Brox,et al. Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Yuting Zhang,et al. Deep Visual Analogy-Making , 2015, NIPS.

[3] Max Jaderberg,et al. Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[4] Jitendra Malik,et al. Learning a Multi-View Stereo Machine , 2017, NIPS.

[5] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[6] Pascal Fua,et al. Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Lourdes Agapito,et al. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Takeo Kanade,et al. Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9] Geoffrey E. Hinton,et al. Transforming Auto-Encoders , 2011, ICANN.

[10] Justus Thies,et al. InverseFaceNet: Deep Monocular Inverse Face Rendering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Hans-Peter Seidel,et al. VNect , 2017, ACM Trans. Graph..

[12] Pieter Abbeel,et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[13] Justus Thies,et al. InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image , 2017, ArXiv.

[14] Ersin Yumer,et al. Neural Face Editing with Intrinsic Image Disentangling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Thomas Brox,et al. Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[16] Thomas Brox,et al. Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[17] Gabriel J. Brostow,et al. Interpretable Transformations with Encoder-Decoder Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Ersin Yumer,et al. Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Xiaowei Zhou,et al. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Joshua B. Tenenbaum,et al. Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[21] Andrea Vedaldi,et al. Unsupervised learning of object frames by dense equivariant image labelling , 2017, NIPS.

[22] Ersin Yumer,et al. Self-supervised Learning of Motion Capture , 2017, NIPS.

[23] Cristian Sminchisescu,et al. Deep Multitask Architecture for Integrated 2D and 3D Human Sensing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Cristian Sminchisescu,et al. Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[26] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Josef Kittler,et al. 3D Morphable Models as Spatial Transformer Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[28] John Flynn,et al. Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Xiaoming Liu,et al. Disentangled Representation Learning GAN for Pose-Invariant Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Bo Zhao,et al. Multi-View Image Generation from a Single-View , 2017, ACM Multimedia.

[31] Jitendra Malik,et al. Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] James J. Little,et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Scott E. Reed,et al. Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[34] Cordelia Schmid,et al. LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yichen Wei,et al. Weakly-supervised Transfer for 3D Human Pose Estimation in the Wild , 2017, ArXiv.

[36] Pascal Fua,et al. Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Cordelia Schmid,et al. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[38] Patrick Pérez,et al. MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Max Welling,et al. Transformation Properties of Learned Visual Representations , 2014, ICLR.

[40] Thomas Brox,et al. Learning to Generate Chairs, Tables and Cars with Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Rogério Schmidt Feris,et al. A Recurrent Encoder-Decoder Network for Sequential Face Alignment , 2016, ECCV.

[42] Hans-Peter Seidel,et al. EgoCap , 2016, ACM Trans. Graph..

[43] Chuang Gan,et al. Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency , 2017, ECCV.

[44] Katerina Fragkiadaki,et al. Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] Zhenhua Wang,et al. Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[46] Jitendra Malik,et al. View Synthesis by Appearance Flow , 2016, ECCV.

[47] Chuang Gan,et al. Unsupervised Domain Adaptation for 3D Keypoint Prediction from a Single Depth Scan , 2017, ArXiv.

[48] Cordelia Schmid,et al. Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Subhransu Maji,et al. 3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[50] Honglak Lee,et al. Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[51] Marcel van Gerven,et al. Deep disentangled representations for volumetric reconstruction , 2016, ECCV Workshops.

[52] Luc Van Gool,et al. Pose Guided Person Image Generation , 2017, NIPS.

[53] Peter V. Gehler,et al. A Generative Model of People in Clothing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[55] Andrea Vedaldi,et al. Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Xiaowei Zhou,et al. Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).