3D Human Pose, Shape and Texture From Low-Resolution Images and Videos

3D human pose and shape estimation from monocular images has been an active research area in computer vision. Existing deep learning methods for this task rely on high-resolution input, which however, is not always available in many scenarios such as video surveillance and sports broadcasting. Two common approaches to deal with low-resolution images are applying super-resolution techniques to the input, which may result in unpleasant artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed method is able to learn 3D body pose and shape across different resolutions with one single model. The self-supervision loss enforces scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new losses provide robustness when learning in a weakly-supervised manner. Moreover, we extend the RSC-Net to handle low-resolution videos and apply it to reconstruct textured 3D pedestrians from low-resolution input. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

[1]  Francesc Moreno-Noguer,et al.  3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning , 2020, ECCV.

[2]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[5]  Marcus A. Magnor,et al.  Tex2Shape: Detailed Full Human Body Geometry From a Single Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[7]  Xiangyu Xu,et al.  Learning Factorized Weight Matrix for Joint Filtering , 2020, ICML.

[8]  Gregory Shakhnarovich,et al.  Task-Driven Super Resolution: Object Detection in Low-resolution Images , 2018, ICONIP.

[9]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[10]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[11]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jian Wang,et al.  Re-Identification Supervised Texture Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Mark Everingham,et al.  Learning effective human pose estimation from inaccurate annotation , 2011, CVPR 2011.

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  Hiroshi Murase,et al.  Exemplar-based human body super-resolution for surveillance camera systems , 2014, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[21]  Shiming Ge,et al.  Low-Resolution Face Recognition in the Wild via Selective Knowledge Distillation , 2018, IEEE Transactions on Image Processing.

[22]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yunchao Wei,et al.  Perceptual Generative Adversarial Networks for Small Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qi Tian,et al.  Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.

[25]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Marcus A. Magnor,et al.  Learning to Reconstruct People in Clothing From a Single RGB Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[30]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[31]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[32]  Weimin Tan,et al.  Feature Super-Resolution: Make Machine See More Clearly , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Wenxiu Sun,et al.  Towards Real Scene Super-Resolution With Raw Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yun Fu,et al.  Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Tao Yu,et al.  DeepHuman: 3D Human Reconstruction From a Single Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Francesc Moreno-Noguer,et al.  3D Human Pose Tracking Priors using Geodesic Mixture Models , 2017, International Journal of Computer Vision.

[39]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[40]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[41]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[42]  Andrew Zisserman,et al.  Sim2real transfer learning for 3D human pose estimation: motion to the rescue , 2019, NeurIPS.

[43]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[45]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[46]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[47]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[48]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[52]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Thomas S. Huang,et al.  Studying Very Low Resolution Recognition Using Deep Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Deqing Sun,et al.  Learning to Super-Resolve Blurry Face and Text Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Francesc Moreno-Noguer,et al.  3DPeople: Modeling the Geometry of Dressed Humans , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[62]  Shaogang Gong,et al.  Low-Resolution Face Recognition , 2018, ACCV.

[63]  Wonhee Lee,et al.  Better to Follow, Follow to Be Better: Towards Precise Supervision of Feature Super-Resolution for Small Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[65]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[66]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[67]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[68]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[69]  Hao Li,et al.  SiCloPe: Silhouette-Based Clothed People , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Andrea Vedaldi,et al.  Tiny People Pose , 2018, ACCV.