Tracking People by Predicting 3D Appearance, Location & Pose

In this paper, we present an approach for tracking people in monocular videos, by predicting their future 3D representations. To achieve this, we first lift people to 3D from a single frame in a robust way. This lifting includes information about the 3D pose of the person, his or her location in the 3D space, and the 3D appearance. As we track a person, we collect 3D observations over time in a tracklet representation. Given the 3D nature of our observations, we build temporal models for each one of the previous attributes. We use these models to predict the future state of the tracklet, including 3D location, 3D appearance, and 3D pose. For a future frame, we compute the similarity between the predicted state of a tracklet and the single frame observations in a probabilistic manner. Association is solved with simple Hungarian matching, and the matches are used to update the respective tracklets. We evaluate our approach on various benchmarks and report state-of-the-art results.

[1]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Zhengyi Luo,et al.  3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[3]  Joachim Tesch,et al.  SPEC: Seeing People in the Wild with an Estimated Camera , 2021, ArXiv.

[4]  Daniel Cremers,et al.  MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking , 2020, International Journal of Computer Vision.

[5]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[6]  Manuel Kaufmann,et al.  A Spatio-temporal Transformer for 3D Human Motion Prediction , 2020, 2021 International Conference on 3D Vision (3DV).

[7]  Wolfram Burgard,et al.  Learning to Track with Object Permanence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Bernt Schiele,et al.  PoseTrack: A Benchmark for Human Pose Estimation and Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Asim Kadav,et al.  15 Keypoints Is All You Need , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kostas Daniilidis,et al.  Probabilistic Modeling for Human Mesh Recovery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[15]  Philip H. S. Torr,et al.  HOTA: A Higher Order Metric for Evaluating Multi-object Tracking , 2020, International Journal of Computer Vision.

[16]  Michael J. Black,et al.  Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Xavier Alameda-Pineda,et al.  How to Train Your Deep Multi-Object Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Zhang Xiong,et al.  Long-Term Tracking With Deep Tracklet Association , 2020, IEEE Transactions on Image Processing.

[20]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Francisco Herrera,et al.  Deep Learning in Video Multi-Object Tracking: A Survey , 2019, Neurocomputing.

[24]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[25]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  M. Shah,et al.  Object tracking: A survey , 2006, CSUR.

[30]  J.,et al.  Optic Flow , 2014, Computer Vision, A Reference Guide.

[31]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[33]  Georgios Pavlakos,et al.  Tracking People with 3D Representations , 2021, NeurIPS.

[34]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  ZhangJing,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video , 2009 .

[36]  Laura Leal-Taixe,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, ArXiv.

[37]  Dahua Lin,et al.  MovieNet: A Holistic Dataset for Movie Understanding , 2020, ECCV.

[38]  Deva Ramanan,et al.  Detecting Invisible People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Georgios Pavlakos,et al.  Human Mesh Recovery from Multiple Shots , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Tao Yu,et al.  4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[42]  Juergen Gall,et al.  Recursive Bayesian Filtering for Multiple Human Pose Tracking from Multiple Cameras , 2020 .

[43]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).