Weakly Supervised 3D Multi-person Pose Estimation for Large-scale Scenes based on Monocular Camera and Single LiDAR

Depth estimation is usually ill-posed and ambiguous for monocular camera-based 3D multi-person pose estimation. Since LiDAR can capture accurate depth information in long-range scenes, it can benefit both the global localization of individuals and the 3D pose estimation by providing rich geometry features. Motivated by this, we propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes, which is easy to deploy and insensitive to light. Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud, and make full use of temporal information to guide the network to learn natural and coherent human motions. Without relying on any 3D pose annotations, our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision. Extensive experiments on public datasets and our newly collected dataset demonstrate the superiority and generalization capability of our proposed method.

[1]  Lan Xu,et al.  LiDAR-aid Inertial Poser: Large-scale Human Motion Capture by Sparse Inertial and LiDAR Sensors , 2022, IEEE Transactions on Visualization and Computer Graphics.

[2]  Lan Xu,et al.  LiCamGait: Gait Recognition in the Wild by Using LiDAR and Camera Multi-modal Visual Sensors , 2022, ArXiv.

[3]  Lan Xu,et al.  Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation , 2022, ACM Multimedia.

[4]  Huizi Mao,et al.  BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Ruigang Yang,et al.  STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lan Xu,et al.  LiDARCap: Long-range Markerless 3D Human Motion Capture with LiDAR Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Lan Xu,et al.  HybridCap: Inertia-aid Monocular Capture of Challenging Human Motions , 2022, AAAI.

[9]  Junsong Yuan,et al.  MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Visesh Chari,et al.  Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Michael J. Black,et al.  Putting People in their Place: Monocular Regression of 3D People in Depth , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xinge Zhu,et al.  Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-Based Perception , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Shuicheng Yan,et al.  Direct Multi-view Multi-person 3D Pose Estimation , 2021, NeurIPS.

[14]  Bingbing Ni,et al.  Skeleton2Mesh: Kinematics Prior Injected Unsupervised Human Mesh Recovery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Xu Zhao,et al.  Rgb-D Fusion For Point-Cloud-Based 3d Human Pose Estimation , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[16]  Michael S. Ryoo,et al.  4D-Net for Learned Multi-Modal Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andreas Geiger,et al.  Multi-Modal Fusion Transformer for End-to-End Autonomous Driving , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xinge Zhu,et al.  Input-Output Balanced Framework for Long-Tailed Lidar Semantic Segmentation , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[20]  Zhengming Ding,et al.  3D Human Pose Estimation with Spatial and Temporal Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Bodo Rosenhahn,et al.  CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Didier Stricker,et al.  HPERL: 3D Human Pose Estimation from RGB and LiDAR , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[24]  Michael J. Black,et al.  Monocular, One-stage, Regression of Multiple 3D People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Hujun Bao,et al.  SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation , 2020, ECCV.

[28]  C. Qian,et al.  HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular Multi-Person 3D Pose Estimation , 2020, ECCV.

[29]  Xinge Zhu,et al.  SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds , 2020, ECCV.

[30]  Yu Cheng,et al.  3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training , 2020, AAAI.

[31]  Alex H. Lang,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Pascal Fua,et al.  XNect , 2019, ACM Trans. Graph..

[34]  Kyoung Mu Lee,et al.  Camera Distance-Aware Top-Down Approach for 3D Multi-Person Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Andrew Zisserman,et al.  Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  András Lörincz,et al.  Absolute Human Pose Estimation with Depth Prediction Network , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[37]  James M. Rehg,et al.  Unsupervised 3D Pose Estimation With Geometric Self-Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Emre Akbas,et al.  Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Hujun Bao,et al.  Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[41]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[42]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[43]  Wolfram Burgard,et al.  3D Human Pose Estimation in RGBD Images for Robotic Task Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[45]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[49]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[53]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[57]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  J. Little,et al.  Inverse perspective mapping simplifies optical flow computation and obstacle detection , 2004, Biological Cybernetics.