HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media

We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and 2 male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc.), along with multi-RGBD (mRGBD), volumetric and audio data. Despite the existence of multi-view color datasets captured with the use of hardware (HW) synchronization, to the best of our knowledge, HUMAN4D is the first and only public resource that provides volumetric depth maps with high synchronization precision due to the use of intra- and inter-sensor HW-SYNC. Moreover, a spatio-temporally aligned scanned and rigged 3D character complements HUMAN4D to enable joint research on time-varying and high-quality dynamic meshes. We provide evaluation baselines by benchmarking HUMAN4D with state-of-the-art human pose estimation and 3D compression methods. We apply OpenPose and AlphaPose reaching 70.02% and 82.95% mAPPCKh-0.5 on single- and 68.48% and 73.94% mAPPCKh-0.5 on two-person 2D pose estimation, respectively. In 3D pose, a recent multi-view approach named Learnable Triangulation, achieves 80.26% mAPPCK3D-10cm. For 3D compression, we benchmark Draco, Corto and CWIPC open-source 3D codecs, respecting online encoding and steady bit-rates between 7–155 and 2–90 Mbps for mesh- and point-based volumetric video, respectively. Qualitative and quantitative visual comparison between mesh-based volumetric data reconstructed in different qualities and captured RGB, showcases the available options with respect to 4D representations. HUMAN4D is introduced to enable joint research on spatio-temporally aligned pose, volumetric, mRGBD and audio data cues. The dataset and its code are available online.

[1]  Petros Daras,et al.  An Integrated Platform for Live 3D Human Reconstruction and Motion Capturing , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Victor Lempitsky,et al.  Learnable Triangulation of Human Pose , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Petros Daras,et al.  DeepMoCap: Deep Optical Motion Capture Using Multiple Depth Sensors and Retro-Reflectors , 2019, Sensors.

[5]  Michael J. Black,et al.  Dynamic FAUST: Registering Human Bodies in Motion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[7]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[8]  Clay Andres,et al.  Canon Hack Development Kit , 2009 .

[9]  M. Shamim Hossain,et al.  Emotion recognition using deep learning approach from audio-visual emotional big data , 2019, Inf. Fusion.

[10]  Piotr Szczuko,et al.  Deep neural networks for human pose estimation from a very low resolution depth image , 2019, Multimedia Tools and Applications.

[11]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Christian Theobalt,et al.  DeepCap: Monocular Human Performance Capture Using Weak Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Pushmeet Kohli,et al.  Fusion4D , 2016, ACM Trans. Graph..

[15]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[16]  Ambrish Tyagi,et al.  PoseNet3D: Unsupervised 3D Human Shape and Pose Estimation , 2020, ArXiv.

[17]  Emre Akbas,et al.  Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Nilanjan Dey,et al.  A Beginner's Guide to Data Agglomeration and Intelligent Sensing , 2020 .

[19]  Paolo Cignoni,et al.  Metro: Measuring Error on Simplified Surfaces , 1998, Comput. Graph. Forum.

[20]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[21]  Georgios Tzimiropoulos,et al.  3D Human Body Reconstruction from a Single Image via Volumetric Regression , 2018, ECCV Workshops.

[22]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Omesh Tickoo,et al.  Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Wenjun Zeng,et al.  Cross View Fusion for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[26]  Petros Daras,et al.  A Low-Cost, Flexible and Portable Volumetric Capturing System , 2018, 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[27]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yinda Zhang,et al.  Deep Implicit Volume Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Giuseppe Valenzise,et al.  Learning Convolutional Transforms for Lossy Point Cloud Geometry Compression , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[30]  Yang Xiao,et al.  A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Rufael Mekuria,et al.  Design, Implementation, and Evaluation of a Point Cloud Codec for Tele-Immersive Video , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[36]  Christian Theobalt,et al.  LiveCap , 2018, ACM Trans. Graph..

[37]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[38]  Petros Daras,et al.  Benchmarking Open-Source Static 3D Mesh Codecs for Immersive Media Interactive Live Streaming , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[39]  Rufael Mekuria,et al.  Draft call for proposals for point cloud compression , 2016 .

[40]  Timothy Jung,et al.  Experiencing immersive virtual reality in museums , 2020, Inf. Manag..

[41]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[42]  Rufael Mekuria,et al.  Emerging MPEG Standards for Point Cloud Compression , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[43]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[44]  Petros Daras,et al.  Quaternionic Signal Processing Techniques for Automatic Evaluation of Dance Performances From MoCap Data , 2014, IEEE Transactions on Multimedia.

[45]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[46]  Frank B. ter Haar,et al.  Virtual reality conferencing: multi-user immersive VR experiences on the web , 2018, MMSys.

[47]  Michael M. Kazhdan,et al.  Reconstruction of solid models from oriented point sets , 2005, SGP '05.

[48]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Petros Daras,et al.  Markerless structure-based multi-sensor calibration for free viewpoint video capture , 2018 .

[52]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[53]  Xiaoyang Liu,et al.  Real-Time Geometry, Albedo, and Motion Reconstruction Using a Single RGB-D Camera , 2017, ACM Trans. Graph..

[54]  Petros Daras,et al.  Motion analysis: Action detection, recognition and evaluation based on motion capture data , 2018, Pattern Recognit..

[55]  Jie Li,et al.  Comparing the Quality of Highly Realistic Digital Humans in 3DoF and 6DoF: A Volumetric Video Case Study , 2020, 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[56]  Jihun Yu,et al.  HUMBI: A Large Multiview Dataset of Human Body Expressions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[58]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Jie Li,et al.  Bottom-up Pose Estimation of Multiple Person with Bounding Box Constraint , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[60]  Matteo Munaro,et al.  Real-time marker-less multi-person 3D pose estimation in RGB-Depth camera networks , 2018, IAS.

[61]  Petros Daras,et al.  Toward Real-Time and Efficient Compression of Human Time-Varying Meshes , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[62]  Charles T. Loop,et al.  Holoportation: Virtual 3D Teleportation in Real-time , 2016, UIST.

[63]  Hao Zhu,et al.  Learned Point Cloud Geometry Compression , 2019, ArXiv.

[64]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Günther Greiner,et al.  On Floating‐Point Normal Vectors , 2010, Comput. Graph. Forum.

[66]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Nicolas Padoy,et al.  A Multi-view RGB-D Approach for Human Pose Estimation in Operating Rooms , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).