PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.

[1]  Xiang Wang,et al.  Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation , 2022, Computer Vision and Pattern Recognition.

[2]  R. Venkatesh Babu,et al.  Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kris Kitani,et al.  Occluded Human Mesh Recovery , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  J. Kautz,et al.  GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Song-Chun Zhu,et al.  Monocular 3D Pose Estimation via Pose Grammar and Data Augmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Kyoung Mu Lee,et al.  Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Georgios Pavlakos,et al.  Human Mesh Recovery from Multiple Shots , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[9]  Yannis Kalantidis,et al.  Leveraging MoCap Data for Human Mesh Recovery , 2021, 2021 International Conference on 3D Vision (3DV).

[10]  Ruben Villegas,et al.  Task-Generic Hierarchical Human Motion Prior using VAEs , 2021, 2021 International Conference on 3D Vision (3DV).

[11]  Leonidas J. Guibas,et al.  HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Tao Jiang,et al.  Skeletor: Skeletal Transformers for Robust Body-Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yashraj S. Narang,et al.  DexYCB: A Benchmark for Capturing Hand Grasping of Objects , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lijuan Wang,et al.  Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Romain Br'egier,et al.  Deep Regression on Manifolds: A 3D Rotation Case Study , 2021, 2021 International Conference on 3D Vision (3DV).

[17]  Zhengxia Zou,et al.  Single-Shot Motion Completion with Transformer , 2021, ArXiv.

[18]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Deva Ramanan,et al.  Detecting Invisible People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Mathieu Salzmann,et al.  Contextually Plausible and Diverse 3D Human Motion Prediction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yang An,et al.  From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kyoung Mu Lee,et al.  Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Manuel Kaufmann,et al.  A Spatio-temporal Transformer for 3D Human Motion Prediction , 2020, 2021 International Conference on 3D Vision (3DV).

[27]  Philippe Weinzaepfel,et al.  SMPLy Benchmarking 3D Human Pose Estimation in the Wild , 2020, 2020 International Conference on 3D Vision (3DV).

[28]  Jie Song,et al.  Convolutional Autoencoders for Human Motion Infilling , 2020, 2020 International Conference on 3D Vision (3DV).

[29]  Ignas Budvytis,et al.  Synthetic Training for Accurate 3D Human Pose and Shape Estimation in the Wild , 2020, BMVC.

[30]  Vincent Leroy,et al.  DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild , 2020, ECCV.

[31]  Kyoung Mu Lee,et al.  Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , 2020, ECCV.

[32]  Michael J. Black,et al.  STAR: Sparse Trained Articulated Human Body Regressor , 2020, ECCV.

[33]  Jie Song,et al.  Human Body Model Fitting by Learned Gradient Descent , 2020, ECCV.

[34]  Zhengyi Luo,et al.  3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[35]  Hyung Jin Chang,et al.  SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation , 2020, ECCV.

[36]  Derek Nowrouzezahrai,et al.  Robust motion in-betweening , 2020, ACM Trans. Graph..

[37]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[38]  Yunhui Liu,et al.  Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction , 2020, ECCV.

[39]  Yizhou Wang,et al.  MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[41]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Dieter Fox,et al.  DexPilot: Vision-Based Teleoperation of Dexterous Robotic Hand-Arm System , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Pascal Fua,et al.  XNect , 2019, ACM Trans. Graph..

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  Fei Wang,et al.  On Boosting Single-Frame 3D Human Pose Estimation via Monocular Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Tao Mei,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Song-Chun Zhu,et al.  DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Liu Wu,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Andrew Zisserman,et al.  Sim2real transfer learning for 3D human pose estimation: motion to the rescue , 2019, NeurIPS.

[54]  Andrew Zisserman,et al.  Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Francesc Moreno-Noguer,et al.  Human Motion Prediction via Spatio-Temporal Inpainting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[63]  Satoru Fukayama,et al.  AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing , 2019, ISMIR.

[64]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[65]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[66]  Cordelia Schmid,et al.  BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.

[67]  Cordelia Schmid,et al.  Image-Based Synthesis for Deep 3D Human Pose Estimation , 2018, International Journal of Computer Vision.

[68]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[70]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[71]  Abhishek Sharma,et al.  Learning 3D Human Pose from Structure and Motion , 2017, ECCV.

[72]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[73]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[74]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[77]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[78]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[79]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[81]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[82]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[83]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[84]  Ronald W. Schafer,et al.  What Is a Savitzky-Golay Filter? [Lecture Notes] , 2011, IEEE Signal Processing Magazine.

[85]  R. Schafer,et al.  What Is a Savitzky-Golay Filter? , 2022 .