Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Synthetic visual data can provide practicically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person’s motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern benchmark for 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.

[1]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  James J. Little,et al.  Exploiting temporal information for 3D pose estimation , 2017, ArXiv.

[5]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yan Wang,et al.  A Simple, Fast and Highly-Accurate Algorithm to Recover 3D Shape from 2D Landmarks on a Single Image , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[12]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[14]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Kyungdon Joo,et al.  Human body part classification from optical flow , 2016, 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI).

[16]  Rama Chellappa,et al.  Visual Domain Adaptation: A survey of recent advances , 2015, IEEE Signal Processing Magazine.

[17]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[20]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Gabriela Csurka,et al.  Domain Adaptation for Visual Applications: A Comprehensive Survey , 2017, ArXiv.

[22]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Michael J. Black,et al.  Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Rishabh Dabral,et al.  Structure-Aware and Temporally Coherent 3D Human Pose Estimation , 2017, ArXiv.

[25]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Andrew Zisserman,et al.  Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michael J. Black,et al.  Detailed Human Shape and Pose from Images , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michael J. Black,et al.  Combined discriminative and generative articulated pose and non-rigid shape estimation , 2007, NIPS.

[31]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH '05.

[32]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[33]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[34]  Antti Oulasvirta,et al.  Fast and robust hand tracking using detection-guided optimization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[36]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Tomas Pfister,et al.  Learning from Simulated and Unsupervised Images through Adversarial Training , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[41]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[42]  Jitendra Malik,et al.  SFV , 2018, ACM Trans. Graph..

[43]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[44]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[45]  Michael J. Black,et al.  Learning Human Optical Flow , 2018, BMVC.

[46]  Sean A. Munson,et al.  Unequal Representation and Gender Stereotypes in Image Search Results for Occupations , 2015, CHI.

[47]  Bernard Ghanem,et al.  Driving Policy Transfer via Modularity and Abstraction , 2018, CoRL.

[48]  Karteek Alahari,et al.  Learning to Segment Moving Objects , 2017, International Journal of Computer Vision.

[49]  Weichao Qiu Generating Human Images and Ground Truth using Computer Graphics , 2016 .

[50]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[53]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[54]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[55]  Bodo Rosenhahn,et al.  Optical Flow-Based 3D Human Motion Estimation from Monocular Video , 2017, GCPR.

[56]  Vladlen Koltun,et al.  Does computer vision matter for action? , 2019, Science Robotics.

[57]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Hans-Peter Seidel,et al.  Multilinear pose and body shape estimation of dressed subjects from image sets , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[59]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[60]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[62]  Mohan S. Kankanhalli,et al.  Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[63]  Cristian Sminchisescu,et al.  Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[64]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[65]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[67]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Simon Lucey,et al.  Deterministic 3D Human Pose Estimation Using Rigid Structure , 2010, ECCV.

[70]  Kai Oliver Arras,et al.  How Robust is 3D Human Pose Estimation to Occlusion? , 2018, ArXiv.

[71]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[72]  Michael J. Black,et al.  FlowCap: 2D Human Pose from Optical Flow , 2015, GCPR.

[73]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[74]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[77]  Bernt Schiele,et al.  Learning people detection models from few training samples , 2011, CVPR 2011.

[78]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[79]  Alexei A. Efros,et al.  Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[81]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[82]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[83]  Stefano Soatto,et al.  Relevant Feature Selection for Human Pose Estimation and Localization in Cluttered Images , 2008, ECCV.

[84]  Vincent Lepetit,et al.  On Pre-Trained Image Features and Synthetic Images for Deep Learning , 2017, ECCV Workshops.

[85]  J. Cutting,et al.  Recognizing friends by their walk: Gait perception without familiarity cues , 1977 .

[86]  Sudeep Sarkar,et al.  Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[87]  Camillo J. Taylor,et al.  Reconstruction of articulated objects from point correspondences in a single uncalibrated image , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[88]  Sergey Levine,et al.  Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[90]  Andrew Owens,et al.  MoSculp: Interactive Visualization of Shape and Time , 2018, UIST.

[91]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[92]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[94]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[95]  Stefan Hinterstoißer,et al.  An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection , 2019, ArXiv.

[96]  J. Cutting,et al.  Recognizing the sex of a walker from a dynamic point-light display , 1977 .

[97]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).