论文信息 - Test-Time Personalization with a Transformer for Human Pose Estimation

Test-Time Personalization with a Transformer for Human Pose Estimation

We propose to personalize a 2D human pose estimator given a set of test images of a person without using any manual annotations. While there is a significant advancement in human pose estimation, it is still very challenging for a model to generalize to different unknown environments and unseen persons. Instead of using a fixed model for every test case, we adapt our pose estimator during test time to exploit person-specific information. We first train our model on diverse data with both a supervised and a self-supervised pose estimation objectives jointly. We use a Transformer model to build a transformation between the self-supervised keypoints and the supervised keypoints. During test time, we personalize and adapt our model by fine-tuning with the self-supervised objective. The pose is then improved by transforming the updated self-supervised keypoints. We experiment with multiple datasets and show significant improvements on pose estimations with our self-supervised personalization. Project page with code is available at https://liyz15.github.io/TTP/.

[1] Alexander Mathis,et al. End-to-End Trainable Multi-Instance Pose Estimation with Transformers , 2021, ArXiv.

[2] Xiaogang Wang,et al. Structured Feature Learning for Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Andrew Zisserman,et al. Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[4] Björn Ommer,et al. Unsupervised Part-Based Disentangling of Object Shape and Appearance , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Ankush Gupta,et al. Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[7] Wankou Yang,et al. TransPose: Towards Explainable Human Pose Estimation by Transformer , 2020, ArXiv.

[8] Yichen Wei,et al. Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[9] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[10] Jitendra Malik,et al. Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12] Shuicheng Yan,et al. Single-Stage Multi-Person Pose Machines , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Ankush Gupta,et al. Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[14] Ankush Gupta,et al. Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Dustin Tran,et al. Image Transformer , 2018, ICML.

[16] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[17] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18] Juergen Gall,et al. Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[19] Dieter Fox,et al. Causal Discovery in Physical Systems from Videos , 2020, NeurIPS.

[20] Luc Van Gool,et al. Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Nuno Vasconcelos,et al. A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[23] Alexei A. Efros,et al. Self-Supervised Policy Adaptation during Deployment , 2020, ICLR.

[24] Georg Heigold,et al. Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Yong Liu,et al. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose , 2020, AAAI.

[27] Dong Liu,et al. Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Lorenzo Torresani,et al. Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Yichen Wei,et al. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[32] Davide Modolo,et al. Combining Detection and Tracking for Human Pose Estimation in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Antoni B. Chan,et al. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[34] Andrew Zisserman,et al. Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Stephen Lin,et al. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[36] Bolei Zhou,et al. Semantic photo manipulation with a generative image prior , 2019, ACM Trans. Graph..

[37] Zhuowen Tu,et al. Pose Recognition with Cascade Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Weiyu Zhang,et al. From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[39] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Xiaogang Wang,et al. Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41] Gang Yu,et al. Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44] Andrew Zisserman,et al. Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45] Xueting Li,et al. Online Adaptation for Consistent Mesh Reconstruction in the Wild , 2020, NeurIPS.

[46] Yuting Zhang,et al. Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Björn Ommer,et al. Self-Supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[50] Xiaolong Wang,et al. Hand-Object Contact Consistency Reasoning for Human Grasps Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[51] HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Michal Irani,et al. "Zero-Shot" Super-Resolution Using Deep Internal Learning , 2017, CVPR.

[53] Chunhua Shen,et al. TFPose: Direct Human Pose Estimation with Transformers , 2021, ArXiv.

[54] Dario Pavllo,et al. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56] Andrew Zisserman,et al. Domain Adaptation for Upper Body Pose Tracking in Signed TV Broadcasts , 2013, BMVC.

[57] Alexei A. Efros,et al. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[58] Yichen Wei,et al. Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[60] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61] Trevor Darrell,et al. Fully Test-time Adaptation by Entropy Minimization , 2020, ArXiv.

[62] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63] Andreas Savakis,et al. UniPose: Unified Human Pose Estimation in Single Images and Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Andrea Vedaldi,et al. Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[65] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[66] Russ Tedrake,et al. Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning , 2020, CoRL.

[67] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[68] Shai Bagon,et al. InGAN: Capturing and Remapping the "DNA" of a Natural Image , 2018 .

[69] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Alan L. Yuille,et al. Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[71] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[72] Jianbo Liu,et al. LSTM Pose Machines , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73] Jonathan Tompson,et al. Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).