论文信息 - PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model~\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation. The code is available at https://github.com/hbing-l/PoSynDA.

[1] W. Liu,et al. KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration , 2023, ArXiv.

[2] Xuansong Xie,et al. Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness , 2023, ArXiv.

[3] Yu-Gang Jiang,et al. Implicit Temporal Modeling with Learnable Alignment for Video Recognition , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Binghui Chen,et al. DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving , 2023, IJCAI.

[5] Jenq-Neng Hwang,et al. Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation , 2023, ArXiv.

[6] K. Han,et al. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Shalini De Mello,et al. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] W. Liu,et al. HDFormer: High-order Directed Transformer for 3D Human Pose Estimation , 2023, IJCAI.

[9] C. Li,et al. Hypergraph Transformer for Skeleton-based Action Recognition , 2022, ArXiv.

[10] Pengyu Li,et al. Longshortnet: Exploring Temporal and Semantic Features Fusion In Streaming Perception , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Xuansong Xie,et al. Procontext: Exploring Progressive Context Transformer for Tracking , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Ben Poole,et al. DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[13] Amit H. Bermano,et al. Human Motion Diffusion Model , 2022, ICLR.

[14] A. Hauptmann,et al. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement , 2022, ACM Multimedia.

[15] C. Li,et al. Generative Action Description Prompts for Skeleton-based Action Recognition , 2022, 2208.05318.

[16] C. Li,et al. Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition , 2022, ECCV.

[17] A. Hauptmann,et al. Rethinking Spatial Invariance of Convolutional Networks for Object Counting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Junsong Yuan,et al. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Z. J. Wang,et al. AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] L. Gool,et al. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ling Shao,et al. Deep 3D human pose estimation: A review , 2021, Comput. Vis. Image Underst..

[22] Yoav Goldberg,et al. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[23] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[24] Jiashi Feng,et al. PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Lijuan Wang,et al. Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Bingbing Ni,et al. Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Zhengming Ding,et al. 3D Human Pose Estimation with Spatial and Temporal Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.

[29] Pichao Wang,et al. TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Xiao Wu,et al. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition , 2020, Neurocomputing.

[31] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[32] Song Han,et al. TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning , 2020, 2007.11622.

[33] Stephen Lin,et al. SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach , 2020, ECCV.

[34] Jiashi Feng,et al. Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation , 2020, NeurIPS.

[35] Haoyi Xiong,et al. Generating Person Images with Appearance-aware Pose Stylizer , 2020, IJCAI.

[36] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[37] Kwang-Ting Cheng,et al. Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Ruixu Liu,et al. Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Dahua Lin,et al. Motion Guided 3D Pose Estimation from Videos , 2020, ECCV.

[40] Andrea Vedaldi,et al. Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[41] Michael J. Black,et al. VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Nadia Magnenat-Thalmann,et al. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Alexander Hauptmann,et al. Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting , 2019, ACM Multimedia.

[44] Alexander Hauptmann,et al. Learning Spatial Awareness to Improve Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Yizhou Wang,et al. Optimizing Network Structure for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] Yu Tian,et al. Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Bodo Rosenhahn,et al. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Dario Pavllo,et al. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Alan L. Yuille,et al. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation , 2018, BMVC.

[50] Bodo Rosenhahn,et al. Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[51] Qiang Peng,et al. Personalized clothing recommendation combining user social circle and fashion style consistency , 2018, Multimedia Tools and Applications.

[52] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Christian Theobalt,et al. Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[54] Yang Liu,et al. Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Yang Liu,et al. Video eCommerce++: Toward Large Scale Online Video Advertising , 2017, IEEE Transactions on Multimedia.

[56] James J. Little,et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Hans-Peter Seidel,et al. VNect , 2017, ACM Trans. Graph..

[58] Bo Zhao,et al. Multi-View Image Generation from a Single-View , 2017, ACM Multimedia.

[59] Yichen Wei,et al. Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60] Ehsan Jahangiri,et al. Generating Multiple Diverse Hypotheses for Human 3D Pose Consistent with 2D Joint Detections , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[61] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[62] Yang Liu,et al. Video eCommerce: Towards Online Video Advertising , 2016, ACM Multimedia.

[63] Wei Zhang,et al. Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[64] Cordelia Schmid,et al. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[65] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[66] Antoni B. Chan,et al. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[67] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68] Aaron C. Courville,et al. Generative Adversarial Networks , 2014, 1406.2661.

[69] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[70] Qing Li,et al. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking , 2017, TRECVID.