POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking

Multi-person pose tracking aims to jointly estimate and track multi-person keypoints in the unconstrained videos. The most popular solution to this task follows the tracking-by-detection strategy that relies on human detection and data association. While human detection has been boosted by deep learning, existing works mainly exploit several separated stages with hand-crafted metrics to realize data association, leading to great uncertainty and feeble adaption in complex scenes. To handle these problems, we propose an end-to-end pose-guided ovonic insight network (POINet) for the data association in multi-person pose tracking, which jointly learns feature extraction, similarity estimation, and identity assignment. Specifically, we design a pose-guided representation network to integrate pose information into hierarchical convolutional features, generating a pose-aligned person representation for person, which helps handle partial occlusions. Moreover, we propose an ovonic insight network to adaptively encode the cross-frame identity transformation, which can cope with the tough tracking cases of person leaving and entering the scene. In general, the proposed POINet provides a new insight to realize multi-person pose tracking in an end-to-end fashion. Extensive experiments conducted on the PoseTrack benchmark demonstrate that our POINet outperforms the state-of-the-art methods.

[1]  Jenq-Neng Hwang,et al.  Exploit the Connectivity: Multi-Object Tracking with TrackletNet , 2018, ACM Multimedia.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[7]  Wenhan Luo,et al.  Multiple object tracking: A literature review , 2014, Artif. Intell..

[8]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Wu,et al.  Multi-Object Tracking with Multiple Cues and Switcher-Aware Classification , 2019, ArXiv.

[10]  Cordelia Schmid,et al.  DeepMatching: Hierarchical Deformable Dense Matching , 2015, International Journal of Computer Vision.

[11]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[12]  Victor S. Lempitsky,et al.  Multi-Region bilinear convolutional neural networks for person re-identification , 2015, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[13]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[14]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrea Palazzi,et al.  Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World , 2018, ECCV.

[16]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[17]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Bernt Schiele,et al.  ArtTrack: Articulated Multi-Person Tracking in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[20]  Juergen Gall,et al.  PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Pong C. Yuen,et al.  Dynamic Graph Co-Matching for Unsupervised Video-Based Person Re-Identification , 2019, IEEE Transactions on Image Processing.

[24]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shin'ichi Satoh,et al.  Person Reidentification via Discrepancy Matrix and Matrix Metric , 2018, IEEE Transactions on Cybernetics.

[26]  Bernt Schiele,et al.  PoseTrack: A Benchmark for Human Pose Estimation and Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[28]  Ioannis A. Kakadiaris,et al.  To Track or To Detect? An Ensemble Framework for Optimal Selection , 2012, ECCV.

[29]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[30]  Anup Basu,et al.  Adaptive Resolution Optimization and Tracklet Reliability Assessment for Efficient Multi-Object Tracking , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Tao Mei,et al.  Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[33]  Nenghai Yu,et al.  Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Wu Liu,et al.  Learning Efficient Spatial-Temporal Gait Features with Deep Learning for Human Identification , 2018, Neuroinformatics.

[35]  Wu Liu,et al.  A Progressive Search Paradigm for the Internet of Things , 2018, IEEE MultiMedia.

[36]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[37]  Ruimin Hu,et al.  Multi-Correlation Filters With Triangle-Structure Constraints for Object Tracking , 2019, IEEE Transactions on Multimedia.

[38]  Wei An,et al.  Semi-Online Multiple Object Tracking Using Graphical Tracklet Association , 2018, IEEE Signal Processing Letters.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Ruimin Hu,et al.  Boosted local classifiers for visual tracking , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[41]  Mubarak Shah,et al.  Deep Affinity Network for Multiple Object Tracking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Ruimin Hu,et al.  Object tracking via online trajectory optimization with multi-feature fusion , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[43]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[44]  Huaping Liu,et al.  Toward Efficient Action Recognition: Principal Backpropagation for Training Two-Stream Networks , 2019, IEEE Transactions on Image Processing.

[45]  Wei Wu,et al.  End-to-End Flow Correlation Tracking with Spatial-Temporal Attention , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Bernt Schiele,et al.  Multiple People Tracking by Lifted Multicut and Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).