论文信息 - UniPose: Unified Human Pose Estimation in Single Images and Videos

UniPose: Unified Human Pose Estimation in Single Images and Videos

We propose UniPose, a unified framework for human pose estimation, based on our “Waterfall” Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPose-LSTM for multi-frame processing and achieves state-of-the-art results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-of-the-art results in single person pose detection for both single images and videos.

Andreas Savakis | Bruno Artacho | A. Savakis | Bruno Artacho

[1] Bo Wang,et al. Occlusion-Aware Networks for 3D Human Pose Estimation in Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Gang Yu,et al. Rethinking on Multi-Stage Networks for Human Pose Estimation , 2019, ArXiv.

[3] Xiaowei Zhou,et al. MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Andrew Zisserman,et al. Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[6] Andrew Zisserman,et al. Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Dong Liu,et al. Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Ming Ye,et al. Cascade Feature Aggregation for Human Pose Estimation , 2019, 1902.07837.

[10] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[11] Iasonas Kokkinos,et al. DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[13] Cordelia Schmid,et al. LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Georgios Tzimiropoulos,et al. Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[15] Wei Liu,et al. ParseNet: Looking Wider to See Better , 2015, ArXiv.

[16] Yi Yang,et al. Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Jonathan Tompson,et al. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model , 2018, ECCV.

[18] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Jianbo Liu,et al. LSTM Pose Machines , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Honggang Qi,et al. Multi-Scale Structure-Aware Network for Human Pose Estimation , 2018, ECCV.

[21] Peter V. Gehler,et al. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Juergen Gall,et al. Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[23] Mark Everingham,et al. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[24] Hui Cheng,et al. Recurrent 3D Pose Sequence Machines , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Andrew Zisserman,et al. Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[26] Navdeep Jaitly,et al. Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[27] George Papandreou,et al. Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[28] Yao-Jen Chang,et al. A Kinect-based system for physical rehabilitation: a pilot study for young adults with motor disabilities. , 2011, Research in developmental disabilities.

[29] Peter V. Gehler,et al. Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Xiaogang Wang,et al. Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Andrew Zisserman,et al. Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[32] Deva Ramanan,et al. N-best maximal decoders for part models , 2011, 2011 International Conference on Computer Vision.

[33] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[34] Yale Song,et al. Continuous body and hand gesture recognition for natural human-computer interaction , 2012, TIIS.

[35] Jonathan Tompson,et al. MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation , 2014, ACCV.

[36] Andrew Zisserman,et al. Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[37] Ali Farhadi,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.

[38] Zhen He,et al. 3D Human Pose Estimation With 2D Marginal Heatmaps , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39] Xiangyang Wang,et al. Improving Human Pose Estimation with Self-Attention Generative Adversarial Networks , 2019, 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[40] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Jonathan Tompson,et al. Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Luc Van Gool,et al. Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Ruigang Yang,et al. Human Pose Estimation with Spatial Contextual Information , 2019, ArXiv.

[44] Kai Zhao,et al. Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Ying Wu,et al. Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[46] Andreas Savakis,et al. Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation , 2019, Sensors.

[47] Weiyu Zhang,et al. From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[48] Guanghan Ning,et al. LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[49] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50] Yuandong Tian,et al. Exploring the Spatial Hierarchy of Mixture Models for Human Pose Estimation , 2012, ECCV.

[51] Ming Ye,et al. Improvement Multi-Stage Model for Human Pose Estimation , 2019, ArXiv.

[52] Vassilis Athitsos,et al. Evaluation of Deep Learning based Pose Estimation for Sign Language Recognition , 2016, PETRA.

[53] Andrew Blake,et al. Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54] Eugenio Culurciello,et al. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[55] Angelos Barmpoutis,et al. Tensor Body: Real-Time Reconstruction of the Human Body and Avatar Synthesis From RGB-D , 2013, IEEE Transactions on Cybernetics.

[56] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[57] Cordelia Schmid,et al. DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[58] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.