Temporal Feature Correlation for Human Pose Estimation in Videos

Effectively utilizing temporal information is critical for human pose estimation in videos. Recent methods either neglect the displacements of keypoints in the video frames, or rely on time-consuming optical flow estimation when fusing temporal information. By contrast, we propose a flow-free and displacement-aware algorithm for pose estimation in videos. Our method is based on the observation that the appearance of the body keypoints remains almost unchanged throughout a video. This motivates us to exploit temporal visual consistency of keypoints via temporal feature correlation to establish sparse correspondences between the keypoints in neigh-boring frames. Specifically, we first extract keypoint features from the previous frame, which can be treated as exemplars to search on the intermediate feature map of the current frame. Then we conduct temporal feature correlation for the keypoint search, and the obtained correlation maps are combined with the convolutional features to further guide heatmap estimation. Extensive experiments demonstrate that the proposed method compares favorably against state-of-the-art approaches on both sub-JHMDB and Penn Action datasets. More importantly, our method is robust to large keypoint displacements and could be applied to videos under fast motion.

[1]  Katerina Fragkiadaki,et al.  Pose from Flow and Flow from Pose , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jianbo Liu,et al.  LSTM Pose Machines , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[7]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[12]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Shih-En Wei Convolutional Pose Machines : A Deep Architecture for Estimating Articulated Poses , 2016 .

[15]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[17]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[18]  Michael J. Black,et al.  Learning Human Optical Flow , 2018, BMVC.

[19]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[20]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Deva Ramanan,et al.  N-best maximal decoders for part models , 2011, 2011 International Conference on Computer Vision.

[22]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[25]  Yaser Sheikh,et al.  Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Juergen Gall,et al.  Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).