Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.

[1]  Xuansong Xie,et al.  PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation , 2023, ArXiv.

[2]  Richang Hong,et al.  3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xuansong Xie,et al.  Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness , 2023, ArXiv.

[4]  Yu-Gang Jiang,et al.  Implicit Temporal Modeling with Learnable Alignment for Video Recognition , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jing-Ming Guo,et al.  UformPose: A U-Shaped Hierarchical Multi-Scale Keypoint-Aware Framework for Human Pose Estimation , 2023, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  W. Liu,et al.  HDFormer: High-order Directed Transformer for 3D Human Pose Estimation , 2023, IJCAI.

[7]  C. Li,et al.  Hypergraph Transformer for Skeleton-based Action Recognition , 2022, ArXiv.

[8]  Sunil K. Agrawal,et al.  ACRNet: Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in Telemedicine , 2022, ArXiv.

[9]  Yumei Zhang,et al.  U-shaped spatial–temporal transformer network for 3D human pose estimation , 2022, Machine Vision and Applications.

[10]  A. Hauptmann,et al.  GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement , 2022, ACM Multimedia.

[11]  Bennamoun,et al.  CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation , 2022, SSRN Electronic Journal.

[12]  Junsong Yuan,et al.  MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  L. Gool,et al.  MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hai Vu,et al.  SST-GCN: Structure aware Spatial-Temporal GCN for 3D Hand Pose Estimation , 2021, 2021 13th International Conference on Knowledge and Systems Engineering (KSE).

[15]  Qixiang Ye,et al.  GraFormer: Graph Convolution Transformer for 3D Pose Estimation , 2021, ArXiv.

[16]  Ling Shao,et al.  Deep 3D human pose estimation: A review , 2021, Comput. Vis. Image Underst..

[17]  Tien-Tsin Wong,et al.  Conditional Directed Graph Convolution for 3D Human Pose Estimation , 2021, ACM Multimedia.

[18]  Jiashi Feng,et al.  PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wataru Takano,et al.  Graph Stacked Hourglass Networks for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Chunyu Wang,et al.  Context Modeling in 3D Human Pose Estimation: A Unified Perspective , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Zhengming Ding,et al.  3D Human Pose Estimation with Spatial and Temporal Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[24]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Bo Wang,et al.  Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos , 2020, AAAI.

[26]  Xiao Wu,et al.  DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition , 2020, Neurocomputing.

[27]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[28]  Wei Tang,et al.  A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation , 2020, ECCV.

[29]  Stephen Lin,et al.  SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach , 2020, ECCV.

[30]  Ruixu Liu,et al.  Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bingbing Ni,et al.  Deep Kinematics Analysis for Monocular 3D Human Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[33]  Dahua Lin,et al.  Motion Guided 3D Pose Estimation from Videos , 2020, ECCV.

[34]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Gim Hee Lee,et al.  Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation , 2019, BMVC.

[37]  Yizhou Wang,et al.  Optimizing Network Structure for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Sanghoon Lee,et al.  Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency , 2018, ECCV.

[42]  J. Faraway Estimation , 2018, Linear Models with Python.

[43]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[44]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[46]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[50]  Yichen Wei,et al.  Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[52]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[54]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[55]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[58]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[60]  Yang Zhao,et al.  Temporally Refined Graph U-Nets for Human Shape and Pose Estimation From Monocular Videos , 2020, IEEE Signal Processing Letters.