Distill Knowledge From NRSfM for Weakly Supervised 3D Pose Learning

We propose to learn a 3D pose estimator by distilling knowledge from Non-Rigid Structure from Motion (NRSfM). Our method uses solely 2D landmark annotations. No 3D data, multi-view/temporal footage, or object specific prior is required. This alleviates the data bottleneck, which is one of the major concern for supervised methods. The challenge for using NRSfM as teacher is that they often make poor depth reconstruction when the 2D projections have strong ambiguity. Directly using those wrong depth as hard target would negatively impact the student. Instead, we propose a novel loss that ties depth prediction to the cost function used in NRSfM. This gives the student pose estimator freedom to reduce depth error by associating with image features. Validated on H3.6M dataset, our learned 3D pose estimation network achieves more accurate reconstruction compared to NRSfM methods. It also outperforms other weakly supervised methods, in spite of using significantly less supervision.

[1]  Takeo Kanade,et al.  Trajectory Space: A Dual Representation for Nonrigid Structure from Motion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Hongdong Li,et al.  A simple prior-free method for non-rigid structure-from-motion factorization , 2012, CVPR.

[3]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Marc Teboulle,et al.  A fast Iterative Shrinkage-Thresholding Algorithm with application to wavelet-based image deblurring , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Simon Lucey,et al.  Deep Interpretable Non-Rigid Structure from Motion , 2019, ArXiv.

[7]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[8]  Anoop Cherian,et al.  Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xiaogang Wang,et al.  3D Human Pose Estimation in the Wild by Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael Elad,et al.  Convolutional Neural Networks Analyzed via Convolutional Sparse Coding , 2016, J. Mach. Learn. Res..

[13]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[14]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[15]  Chen Kong,et al.  Prior-Less Compressible Structure from Motion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[17]  Simon Lucey,et al.  Complex Non-rigid Motion 3D Reconstruction by Union of Subspaces , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Aleix M. Martínez,et al.  Kernel non-rigid structure from motion , 2011, 2011 International Conference on Computer Vision.

[20]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[21]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Bodo Rosenhahn,et al.  RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Songhwai Oh,et al.  Consensus of Non-rigid Reconstructions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Chen Kong,et al.  Structure from Category: A Generic and Prior-Less Approach , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[25]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[26]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[29]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Aleix M. Martínez,et al.  Learning Spatially-Smooth Mappings in Non-Rigid Structure From Motion , 2012, ECCV.

[31]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Aleix M. Martínez,et al.  Computing Smooth Time Trajectories for Camera and Deformable Shape in Structure from Motion with Occlusion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Alessio Del Bue,et al.  Non-rigid structure from motion using ranklet-based tracking and non-linear optimization , 2007, Image Vis. Comput..

[34]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Deva Ramanan,et al.  3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Pascal Fua,et al.  Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[38]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[40]  F U GotardoPaulo,et al.  Computing Smooth Time Trajectories for Camera and Deformable Shape in Structure from Motion with Occlusion , 2011 .

[41]  Abhishek Sharma,et al.  Learning 3D Human Pose from Structure and Motion , 2017, ECCV.

[42]  Richard G. Baraniuk,et al.  Sparse Coding via Thresholding and Local Competition in Neural Circuits , 2008, Neural Computation.

[43]  Ruigang Yang,et al.  ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hongdong Li,et al.  Multi-Body Non-Rigid Structure-from-Motion , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[45]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Ambrish Tyagi,et al.  Can 3D Pose be Learned from 2D Projections Alone? , 2018, ECCV Workshops.

[47]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Francesc Moreno-Noguer,et al.  Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-rigid Categories , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.