论文信息 - Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions

Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions

4D reconstruction of human-object interaction is critical for immersive VR/AR experience and human activity understanding. Recent advances still fail to recover fine geometry and texture results from sparse RGB inputs, especially under challenging human-object interactions scenarios. In this paper, we propose a neural human performance capture and rendering system to generate both high-quality geometry and photo-realistic texture of both human and objects under challenging interaction scenarios in arbitrary novel views, from only sparse RGB streams. To deal with complex occlusions raised by human-object interactions, we adopt a layer-wise scene decoupling strategy and perform volumetric reconstruction and neural rendering of the human and object. Specifically, for geometry reconstruction, we propose an interaction-aware human-object capture scheme that jointly considers the human reconstruction and object reconstruction with their correlations. Occlusion-aware human reconstruction and robust human-aware object tracking are proposed for consistent 4D human-object dynamic reconstruction. For neural texture rendering, we propose a layer-wise human-object rendering scheme, which combines direction-aware neural blending weight learning and spatial-temporal texture completion to provide high-resolution and photo-realistic texture results in the occluded scenarios. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality geometry and texture reconstruction in free viewpoints for challenging human-object interactions.

[1] Yong Jae Lee,et al. YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Tao Yu,et al. PaMIR: Parametric Model-Conditioned Implicit Representation for Image-Based Human Reconstruction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[4] Lan Xu,et al. RobustFusion: Robust Volumetric Performance Reconstruction Under Human-Object Interactions From Monocular RGBD Stream , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Lan Xu,et al. NeuralHumanFVV: Real-Time Neural Volumetric Human Performance Rendering using RGB Cameras , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Hao Li,et al. DoubleFusion: Real-Time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Adrian Hilton,et al. Temporally Coherent General Dynamic Scene Reconstruction , 2019, International Journal of Computer Vision.

[8] Xiaolong Wang,et al. Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Dimitrios Tzionas,et al. Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Vincent Lepetit,et al. HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction , 2021, ArXiv.

[11] Andrew W. Fitzgibbon,et al. KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[12] Jingyi Yu,et al. Editable free-viewpoint video using a layered neural representation , 2021, ACM Trans. Graph..

[13] Andrew J. Davison,et al. DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[14] Yaser Sheikh,et al. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Changil Kim,et al. Space-time Neural Irradiance Fields for Free-Viewpoint Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Joachim Tesch,et al. AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Noah Snavely,et al. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Hao Su,et al. MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Christian Theobalt,et al. LiveCap , 2018, ACM Trans. Graph..

[21] Yan Zhang,et al. PLACE: Proximity Learning of Articulation and Contact in 3D Environments , 2020, 2020 International Conference on 3D Vision (3DV).

[22] Dimitrios Tzionas,et al. GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[23] Yaser Sheikh,et al. Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Tao Yu,et al. Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Pratul P. Srinivasan,et al. NeRF , 2020, ECCV.

[26] Tao Yu,et al. RobustFusion: Human Volumetric Capture with Data-Driven Visual Cues Using a RGBD Camera , 2020, ECCV.

[27] Francesc Moreno-Noguer,et al. D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Tao Yu,et al. HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs , 2018, ECCV.

[29] Wan-Yen Lo,et al. Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[30] Sebastian Nowozin,et al. Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Hao Li,et al. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Xiaowei Zhou,et al. Coherent Reconstruction of Multiple Humans From a Single Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Xin Chen,et al. TightCap: 3D Human Shape Capture with Clothing Tightness Field , 2019, ACM Trans. Graph..

[34] Torsten Sattler,et al. Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Dieter Fox,et al. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Jingyi Yu,et al. Few-shot Neural Human Performance Rendering from Sparse RGBD Videos , 2021, IJCAI.

[37] Jingyi Yu,et al. MirrorNeRF: One-shot Neural Portrait Radiance Field from Multi-mirror Catadioptric Imaging , 2021, 2021 IEEE International Conference on Computational Photography (ICCP).

[38] Hao Su,et al. GNeRF: GAN-based Neural Radiance Field without Posed Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] Hujun Bao,et al. PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Qionghai Dai,et al. FlyCap: Markerless Motion Capture Using Multiple Autonomous Flying Cameras , 2016, IEEE Transactions on Visualization and Computer Graphics.

[41] Justus Thies,et al. Deferred neural rendering , 2019, ACM Trans. Graph..

[42] Pascal Fua,et al. On benchmarking camera calibration and multi-view stereo for high resolution imagery , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Wei Cheng,et al. FlyFusion: Realtime Dynamic Scene Reconstruction Using a Flying Depth Camera , 2021, IEEE Transactions on Visualization and Computer Graphics.

[44] Hanbyul Joo,et al. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Kun Zhou,et al. AutoSweep: Recovering 3D Editable Objects from a Single Photograph , 2020, IEEE Transactions on Visualization and Computer Graphics.

[46] Yangang Wang,et al. Object-Occluded Human Shape and Pose Estimation From a Single Color Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Gerard Pons-Moll,et al. Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Carlos Hernandez,et al. Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[49] Lu Fang,et al. UnstructuredFusion: Realtime 4D Geometry and Texture Reconstruction Using Commercial RGBD Cameras , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50] Hujun Bao,et al. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Minye Wu,et al. ChallenCap: Monocular 3D Capture of Challenging Human Performances using Multi-Modal References , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Deva Ramanan,et al. Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild , 2020, ECCV.

[53] Gregory R. Koch,et al. Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[54] Minh Vo,et al. Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[55] Joachim Tesch,et al. Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Jingyi Yu,et al. SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos , 2021, International Journal of Computer Vision.

[57] Yanshun Zhang,et al. Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning , 2020, ACM Multimedia.

[58] Yaser Sheikh,et al. Neural volumes , 2019, ACM Trans. Graph..

[59] Alvaro Collet,et al. High-quality streamable free-viewpoint video , 2015, ACM Trans. Graph..

[60] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2023 .

[61] James M. Rehg,et al. 4D Human Body Capture from Egocentric Video via 3D Scene Grounding , 2020, 2021 International Conference on 3D Vision (3DV).

[62] Qiang Hu,et al. Multi-View Neural Human Rendering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Shahram Izadi,et al. Motion2fusion , 2017, ACM Trans. Graph..

[64] Yan Zhang,et al. Generating 3D People in Scenes Without People , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Christian Theobalt,et al. Neural Rendering and Reenactment of Human Actor Videos , 2018, ACM Trans. Graph..

[66] Christian Theobalt,et al. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[67] Richard A. Newcombe,et al. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[69] Tao Yu,et al. BodyFusion: Real-Time Capture of Human Motion and Surface Geometry Using a Single Depth Camera , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).