PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model~\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation. The code is available at https://github.com/hbing-l/PoSynDA.

[1]  W. Liu,et al.  KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration , 2023, ArXiv.

[2]  Xuansong Xie,et al.  Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness , 2023, ArXiv.

[3]  Yu-Gang Jiang,et al.  Implicit Temporal Modeling with Learnable Alignment for Video Recognition , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Binghui Chen,et al.  DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving , 2023, IJCAI.

[5]  Jenq-Neng Hwang,et al.  Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation , 2023, ArXiv.

[6]  K. Han,et al.  Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Shalini De Mello,et al.  Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  W. Liu,et al.  HDFormer: High-order Directed Transformer for 3D Human Pose Estimation , 2023, IJCAI.

[9]  C. Li,et al.  Hypergraph Transformer for Skeleton-based Action Recognition , 2022, ArXiv.

[10]  Pengyu Li,et al.  Longshortnet: Exploring Temporal and Semantic Features Fusion In Streaming Perception , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Xuansong Xie,et al.  Procontext: Exploring Progressive Context Transformer for Tracking , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[13]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[14]  A. Hauptmann,et al.  GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement , 2022, ACM Multimedia.

[15]  C. Li,et al.  Generative Action Description Prompts for Skeleton-based Action Recognition , 2022, 2208.05318.

[16]  C. Li,et al.  Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition , 2022, ECCV.

[17]  A. Hauptmann,et al.  Rethinking Spatial Invariance of Convolutional Networks for Object Counting , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Junsong Yuan,et al.  MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Z. J. Wang,et al.  AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  L. Gool,et al.  MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ling Shao,et al.  Deep 3D human pose estimation: A review , 2021, Comput. Vis. Image Underst..

[22]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[23]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[24]  Jiashi Feng,et al.  PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Lijuan Wang,et al.  Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Bingbing Ni,et al.  Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Zhengming Ding,et al.  3D Human Pose Estimation with Spatial and Temporal Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[29]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Xiao Wu,et al.  DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition , 2020, Neurocomputing.

[31]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[32]  Song Han,et al.  TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning , 2020, 2007.11622.

[33]  Stephen Lin,et al.  SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach , 2020, ECCV.

[34]  Jiashi Feng,et al.  Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation , 2020, NeurIPS.

[35]  Haoyi Xiong,et al.  Generating Person Images with Appearance-aware Pose Stylizer , 2020, IJCAI.

[36]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[37]  Kwang-Ting Cheng,et al.  Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ruixu Liu,et al.  Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Dahua Lin,et al.  Motion Guided 3D Pose Estimation from Videos , 2020, ECCV.

[40]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[41]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Alexander Hauptmann,et al.  Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting , 2019, ACM Multimedia.

[44]  Alexander Hauptmann,et al.  Learning Spatial Awareness to Improve Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Yizhou Wang,et al.  Optimizing Network Structure for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Bodo Rosenhahn,et al.  RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Alan L. Yuille,et al.  OriNet: A Fully Convolutional Network for 3D Human Pose Estimation , 2018, BMVC.

[50]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[51]  Qiang Peng,et al.  Personalized clothing recommendation combining user social circle and fashion style consistency , 2018, Multimedia Tools and Applications.

[52]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[54]  Yang Liu,et al.  Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yang Liu,et al.  Video eCommerce++: Toward Large Scale Online Video Advertising , 2017, IEEE Transactions on Multimedia.

[56]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[58]  Bo Zhao,et al.  Multi-View Image Generation from a Single-View , 2017, ACM Multimedia.

[59]  Yichen Wei,et al.  Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Ehsan Jahangiri,et al.  Generating Multiple Diverse Hypotheses for Human 3D Pose Consistent with 2D Joint Detections , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[61]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[62]  Yang Liu,et al.  Video eCommerce: Towards Online Video Advertising , 2016, ACM Multimedia.

[63]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[64]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[65]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[66]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[67]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Aaron C. Courville,et al.  Generative Adversarial Networks , 2014, 1406.2661.

[69]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[70]  Qing Li,et al.  VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking , 2017, TRECVID.