LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking

In this paper, we propose a simple yet effective framework, named LightTrack, for online human pose tracking. Existing methods usually perform human detection, pose estimation and tracking in sequential stages, where pose tracking is regarded as an offline bipartite matching problem. Our proposed framework is designed to be generic, efficient and truly online for top-down approaches. For efficiency, Single-Person Pose Tracking (SPT) and Visual Object Tracking (VOT) are incorporated as a unified on-line functioning entity, easily implemented by a replaceable single-person pose estimator. To mitigate offline optimization costs, the framework also unifies SPT with online identity association and sheds first light upon bridging multi-person keypoint tracking with Multi-Target Object Tracking (MOT). Specifically, we propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module. In contrary to other Re-ID modules, we use a graphical representation of human joints for matching. The skeleton-based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shifts that introduce human drifting. The proposed framework is general enough to fit other pose estimators and candidate matching mechanisms. Extensive experiments show that our method outperforms other online methods and is very competitive with offline state-of-the-art methods while maintaining higher frame rates. Code and models are publicly available at https://github.com/Guanghan/lighttrack.

[1]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Juergen Gall,et al.  Joint Flow: Temporal Flow Fields for Multi Person Tracking , 2018, ArXiv.

[3]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[4]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Honggang Qi,et al.  Multi-Scale Structure-Aware Network for Human Pose Estimation , 2018, ECCV.

[6]  Xingyi Zhou,et al.  Bottom-Up Object Detection by Grouping Extreme and Center Points , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wanli Ouyang,et al.  Towards Multi-Person Pose Tracking : Bottom-up and Top-down Methods , 2017 .

[10]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[13]  Yaser Sheikh,et al.  Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kyoung Mu Lee,et al.  PoseFix: Model-Agnostic General Human Pose Refinement Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Peng Wang,et al.  Joint Multi-person Pose Estimation and Semantic Part Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gang Yu,et al.  Rethinking on Multi-Stage Networks for Human Pose Estimation , 2019, ArXiv.

[17]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[18]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[19]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[20]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[21]  Bernt Schiele,et al.  PoseTrack: A Benchmark for Human Pose Estimation and Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[27]  Juergen Gall,et al.  PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[30]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Bernt Schiele,et al.  Learning to Refine Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Cewu Lu,et al.  PoseHD: Boosting Human Detectors Using Human Pose Information , 2018, AAAI.

[33]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luc Van Gool,et al.  Deep Extreme Cut: From Extreme Points to Object Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[37]  Hao Zhu,et al.  CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[39]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.