Unsupervised Cross-Dataset Adaptation via Probabilistic Amodal 3D Human Pose Completion

Despite remarkable success of supervised deep learning models for 3D human pose estimation, performance of such models is mostly limited to constrained laboratory settings. Such models not only exhibit an alarming level of dataset bias, but also fail to operate on unconstrained videos in the presence of external variations such as camera motion, partial body visibility, occlusion, etc. Acknowledging these shortcomings, firstly, we aim to formalize a motion representation learning framework by effectively utilizing both constrained and artificially generated unconstrained video samples for datasets with 3D pose annotation. Without ignoring the inherent uncertainty in pose estimation for the truncated video frames, we devise a novel probabilistic amodal pose completion framework to enable generation of multiple plausible pose-filling outcomes. Secondly, to address dataset bias, the probabilistic amodal framework is reutilized to design novel self-supervised objectives. This not only enables adaptation of the model to target unannotated datasets (wild YouTube videos) but also encourages learning of generic motion representations beyond the available supervised data even in unconstrained scenarios. Such a training regime helps us achieve state-of-the art performance on unsupervised cross-dataset pose estimation, with a significant improvement in partially-visible unconstrained scenarios.

[1]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yun Fu,et al.  Videography-Based Unconstrained Video Analysis , 2017, IEEE Transactions on Image Processing.

[3]  Vincent Lepetit,et al.  Learning Latent Representations of 3D Human Pose with Deep Neural Networks , 2018, International Journal of Computer Vision.

[4]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mohan S. Kankanhalli,et al.  Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[7]  John J. Foxe,et al.  Setting Boundaries: Brain Dynamics of Modal and Amodal Illusory Shape Completion in Humans , 2004, The Journal of Neuroscience.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  R. Venkatesh Babu,et al.  GAN-Tree: An Incrementally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Emre Akbas,et al.  Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  R. Venkatesh Babu,et al.  AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Ali Farhadi,et al.  SeGAN: Segmenting and Generating the Invisible , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Jitendra Malik,et al.  Amodal Completion and Size Constancy in Natural Scenes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[15]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[16]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Andrea Vedaldi,et al.  Self-Supervised Learning of Geometrically Stable Features Through Probabilistic Introspection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[19]  Maria A. Amer,et al.  Deep 3D Human Pose Estimation Under Partial Body Presence , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[20]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  R. Venkatesh Babu,et al.  UM-Adapt: Unsupervised Multi-Task Adaptation Using Adversarial Cross-Task Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[23]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Antoni B. Chan,et al.  Martial Arts, Dancing and Sports dataset: A challenging stereo and multi-view dataset for 3D human pose estimation , 2017, Image Vis. Comput..

[26]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hui Cheng,et al.  Recurrent 3D Pose Sequence Machines , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[29]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  R. Venkatesh Babu,et al.  Unsupervised Feature Learning of Human Actions As Trajectories in Pose Embedding Manifold , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Gregory Shakhnarovich,et al.  Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yuandong Tian,et al.  Semantic Amodal Segmentation , 2015, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[37]  Jitendra Malik,et al.  Amodal Instance Segmentation , 2016, ECCV.

[38]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[39]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[43]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[46]  R. Venkatesh Babu,et al.  BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN , 2018, AAAI.

[47]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Andrea Vedaldi,et al.  Unsupervised learning of object frames by dense equivariant image labelling , 2017, NIPS.

[49]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[50]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[51]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[53]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[57]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[58]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[59]  Charles M Terry,et al.  The martial arts. , 2006, Physical medicine and rehabilitation clinics of North America.