Test-Time Personalization with a Transformer for Human Pose Estimation

We propose to personalize a 2D human pose estimator given a set of test images of a person without using any manual annotations. While there is a significant advancement in human pose estimation, it is still very challenging for a model to generalize to different unknown environments and unseen persons. Instead of using a fixed model for every test case, we adapt our pose estimator during test time to exploit person-specific information. We first train our model on diverse data with both a supervised and a self-supervised pose estimation objectives jointly. We use a Transformer model to build a transformation between the self-supervised keypoints and the supervised keypoints. During test time, we personalize and adapt our model by fine-tuning with the self-supervised objective. The pose is then improved by transforming the updated self-supervised keypoints. We experiment with multiple datasets and show significant improvements on pose estimations with our self-supervised personalization. Project page with code is available at https://liyz15.github.io/TTP/.

[1]  Alexander Mathis,et al.  End-to-End Trainable Multi-Instance Pose Estimation with Transformers , 2021, ArXiv.

[2]  Xiaogang Wang,et al.  Structured Feature Learning for Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[4]  Björn Ommer,et al.  Unsupervised Part-Based Disentangling of Object Shape and Appearance , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[7]  Wankou Yang,et al.  TransPose: Towards Explainable Human Pose Estimation by Transformer , 2020, ArXiv.

[8]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[9]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[10]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Shuicheng Yan,et al.  Single-Stage Multi-Person Pose Machines , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[14]  Ankush Gupta,et al.  Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[16]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[17]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18]  Juergen Gall,et al.  Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[19]  Dieter Fox,et al.  Causal Discovery in Physical Systems from Videos , 2020, NeurIPS.

[20]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nuno Vasconcelos,et al.  A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[23]  Alexei A. Efros,et al.  Self-Supervised Policy Adaptation during Deployment , 2020, ICLR.

[24]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Yong Liu,et al.  Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose , 2020, AAAI.

[27]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[32]  Davide Modolo,et al.  Combining Detection and Tracking for Human Pose Estimation in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[34]  Andrew Zisserman,et al.  Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Stephen Lin,et al.  GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[36]  Bolei Zhou,et al.  Semantic photo manipulation with a generative image prior , 2019, ACM Trans. Graph..

[37]  Zhuowen Tu,et al.  Pose Recognition with Cascade Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Xueting Li,et al.  Online Adaptation for Consistent Mesh Reconstruction in the Wild , 2020, NeurIPS.

[46]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Björn Ommer,et al.  Self-Supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[50]  Xiaolong Wang,et al.  Hand-Object Contact Consistency Reasoning for Human Grasps Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Michal Irani,et al.  "Zero-Shot" Super-Resolution Using Deep Internal Learning , 2017, CVPR.

[53]  Chunhua Shen,et al.  TFPose: Direct Human Pose Estimation with Transformers , 2021, ArXiv.

[54]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Andrew Zisserman,et al.  Domain Adaptation for Upper Body Pose Tracking in Signed TV Broadcasts , 2013, BMVC.

[57]  Alexei A. Efros,et al.  Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[58]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[60]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Trevor Darrell,et al.  Fully Test-time Adaptation by Entropy Minimization , 2020, ArXiv.

[62]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Andreas Savakis,et al.  UniPose: Unified Human Pose Estimation in Single Images and Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[65]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[66]  Russ Tedrake,et al.  Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning , 2020, CoRL.

[67]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[68]  Shai Bagon,et al.  InGAN: Capturing and Remapping the "DNA" of a Natural Image , 2018 .

[69]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[71]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[72]  Jianbo Liu,et al.  LSTM Pose Machines , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).