ArtTrack: Articulated Multi-Person Tracking in the Wild

In this paper we propose an approach for articulated tracking of multiple people in unconstrained videos. Our starting point is a model that resembles existing architectures for single-frame pose estimation but is substantially faster. We achieve this in two ways: (1) by simplifying and sparsifying the body-part relationship graph and leveraging recent methods for faster inference, and (2) by offloading a substantial share of computation onto a feed-forward convolutional architecture that is able to detect and associate body joints of the same person even in clutter. We use this model to generate proposals for body joint locations and formulate articulated tracking as spatio-temporal grouping of such proposals. This allows to jointly solve the association problem for all people in the scene by propagating evidence from strong detections through time and enforcing constraints that each proposal can be assigned to one person only. We report results on a public MPII Human Pose benchmark and on a new MPII Video Pose dataset of image sequences with multiple people. We demonstrate that our model achieves state-of-the-art results while using only a fraction of time and is able to leverage temporal information to improve state-of-the-art for crowded scenes.

[1]  Juergen Gall,et al.  PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bernt Schiele,et al.  Multi-person Tracking by Multicut and Deep Matching , 2016, ECCV Workshops.

[3]  Georgios Tzimiropoulos,et al.  Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[4]  Andrew Zisserman,et al.  Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vittorio Ferrari,et al.  We Are Family: Joint Pose Estimation of Multiple Persons , 2010, ECCV.

[6]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Juergen Gall,et al.  Multi-person Pose Estimation with Local Joint-to-Person Associations , 2016, ECCV Workshops.

[8]  Bernt Schiele,et al.  Subgraph decomposition for multi-target tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Thomas Brox,et al.  Joint Graph Decomposition & Node Labeling: Problem, Algorithms, Applications , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[16]  Stefan Roth,et al.  People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ben Taskar,et al.  Learning Adaptive Value of Information for Structured Prediction , 2013, NIPS.

[18]  Ben Taskar,et al.  Parsing human motion with stretchable models , 2011, CVPR 2011.

[19]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[21]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Silvio Savarese,et al.  Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[24]  Cordelia Schmid,et al.  Mixing Body-Part Sequences for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Alan L. Yuille,et al.  Parsing occluded people by flexible compositions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Thomas Brox,et al.  Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts , 2015, ICCV.

[28]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[29]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).