TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new"maps"to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.

[1]  Yong Zhang,et al.  High-Fidelity Clothed Avatar Reconstruction from a Single Image , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael J. Black,et al.  3D Human Pose Estimation via Intuitive Physics , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  J. Malik,et al.  Decoupling Human and Camera Motion from Videos in the Wild , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael J. Black,et al.  ECON: Explicit Clothed humans Optimized via Normal integration , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Michael J. Black,et al.  MIME: Human-Aware 3D Scene Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael J. Black,et al.  Generating Holistic 3D Human Motion from Speech , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gang Yu,et al.  D&D: Learning Human Dynamics from Dynamic Camera , 2022, ECCV.

[8]  Jia Deng,et al.  Deep Patch Visual Odometry , 2022, ArXiv.

[9]  Jianzhuang Liu,et al.  CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation , 2022, ECCV.

[10]  B. Schiele,et al.  PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jing Zhang,et al.  ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation , 2022, NeurIPS.

[12]  Michael J. Black,et al.  ICON: Implicit Clothed humans Obtained from Normals , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael J. Black,et al.  Putting People in their Place: Monocular Regression of 3D People in Depth , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Georgios Pavlakos,et al.  Tracking People by Predicting 3D Appearance, Location & Pose , 2021, ArXiv.

[15]  J. Kautz,et al.  GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Georgios Pavlakos,et al.  Tracking People with 3D Representations , 2021, NeurIPS.

[17]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[18]  Michael J. Black,et al.  SPEC: Seeing People in the Wild with an Estimated Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Kostas Daniilidis,et al.  Probabilistic Modeling for Human Mesh Recovery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Simone Calderara,et al.  MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[22]  Silvio Savarese,et al.  JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Qionghai Dai,et al.  DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Zhenan Sun,et al.  PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  James M. Rehg,et al.  4D Human Body Capture from Egocentric Video via 3D Scene Grounding , 2020, 2021 International Conference on 3D Vision (3DV).

[29]  Kyoung Mu Lee,et al.  Pose2Pose: 3D Positional Pose-Guided 3D Rotational Pose Prediction for Expressive 3D Human Pose and Mesh Estimation , 2020, ArXiv.

[30]  Philip H. S. Torr,et al.  HOTA: A Higher Order Metric for Evaluating Multi-object Tracking , 2020, International Journal of Computer Vision.

[31]  Michael J. Black,et al.  Monocular, One-stage, Regression of Multiple 3D People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Deva Ramanan,et al.  TAO: A Large-Scale Benchmark for Tracking Any Object , 2020, ECCV.

[33]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[34]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[35]  Daniel Cremers,et al.  MOT20: A benchmark for multi object tracking in crowded scenes , 2020, ArXiv.

[36]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Tao Mei,et al.  POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking , 2019, ACM Multimedia.

[38]  Kostas Daniilidis,et al.  TexturePose: Supervising Human Mesh Estimation With Texture Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Thomas S. Huang,et al.  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Liu Wu,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Luc Van Gool,et al.  The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation , 2019, ArXiv.

[43]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[46]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Cewu Lu,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[48]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[50]  Bernt Schiele,et al.  CityPersons: A Diverse Dataset for Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[52]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[53]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[54]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[56]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[59]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[62]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.