HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video

We introduce a free-viewpoint rendering method – HumanNeRF – that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challenging, as it requires synthesizing photorealistic details of the body, as seen from various camera angles that may not exist in the input video, as well as synthesizing fine details such as cloth folds and facial appearance. Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps. The motion field is decomposed into skeletal rigid and non-rigid motions, produced by deep networks. We show significant performance improvements over prior work, and compelling examples of free-viewpoint renderings from monocular video of moving humans in challenging uncontrolled capture scenarios. 1e.g., https://youtu.be/0ORaAnJYROg

[1]  Ramesh Raskar,et al.  Image-based visual hulls , 2000, SIGGRAPH.

[2]  Andreas Geiger,et al.  GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis , 2020, NeurIPS.

[3]  Atsushi Nakazawa,et al.  Human video textures , 2009, I3D '09.

[4]  Kai Zhang,et al.  NeRF++: Analyzing and Improving Neural Radiance Fields , 2020, ArXiv.

[5]  Michael J. Black,et al.  SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Christian Theobalt,et al.  Style and Pose Control for Image Synthesis of Humans from a Single Monocular View , 2021, ArXiv.

[7]  Ersin Yumer,et al.  S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Christian Theobalt,et al.  Real-time deep dynamic characters , 2021, ACM Transactions on Graphics.

[9]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[10]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Noah Snavely,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[13]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hao Li,et al.  ARCH: Animatable Reconstruction of Clothed Humans , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Johannes Kopf,et al.  Dynamic View Synthesis from Dynamic Monocular Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Ira Kemelmacher-Shlizerman,et al.  Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the Wild , 2020, ArXiv.

[17]  Harry Shum,et al.  Review of image-based rendering techniques , 2000, Visual Communications and Image Processing.

[18]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Changil Kim,et al.  Space-time Neural Irradiance Fields for Free-Viewpoint Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Takeo Kanade,et al.  Virtualized Reality: Constructing Virtual Worlds from Real Scenes , 1997, IEEE Multim..

[21]  Peter Hedman,et al.  Instant 3D photography , 2018, ACM Trans. Graph..

[22]  Michael J. Black,et al.  LEAP: Learning Articulated Occupancy of People , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Michael J. Black,et al.  SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes , 2021, IEEE International Conference on Computer Vision.

[24]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[25]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Lance Williams,et al.  View Interpolation for Image Synthesis , 1993, SIGGRAPH.

[27]  Adrian Hilton,et al.  4D video textures for interactive character appearance , 2014, Comput. Graph. Forum.

[28]  Jonathan T. Barron,et al.  NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nelson L. Max,et al.  Optical Models for Direct Volume Rendering , 1995, IEEE Trans. Vis. Comput. Graph..

[30]  Bharat Lal Bhatnagar,et al.  LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration , 2020, NeurIPS.

[31]  Steven M. Seitz,et al.  LookinGood , 2018, ACM Trans. Graph..

[32]  Andrea Tagliasacchi,et al.  Volumetric Capture of Humans With a Single RGBD Camera via Semi-Parametric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Richard Szeliski,et al.  High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[34]  Jonathan T. Barron,et al.  IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Paul Debevec,et al.  NeRFactor , 2021, ACM Trans. Graph..

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Hujun Bao,et al.  Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[39]  Hujun Bao,et al.  Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Frédo Durand,et al.  Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Michael J. Black,et al.  Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D Shape, Pose, and Appearance Consistency , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Jonathan T. Barron,et al.  Baking Neural Radiance Fields for Real-Time View Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  George Drettakis,et al.  Depth synthesis and local warps for plausible image-based navigation , 2013, TOGS.

[44]  Christian Theobalt,et al.  Neural actor , 2021, ACM Trans. Graph..

[45]  M. Zollhöfer,et al.  Learning Dynamic Textures for Neural Rendering of Human Actors , 2020, IEEE Transactions on Visualization and Computer Graphics.

[46]  Jonathan T. Barron,et al.  HyperNeRF , 2021, ACM Trans. Graph..

[47]  Hans-Peter Seidel,et al.  Free-viewpoint video of human actors , 2003, ACM Trans. Graph..

[48]  Stephen Lin,et al.  Neural Articulated Radiance Field , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Alvaro Collet,et al.  High-quality streamable free-viewpoint video , 2015, ACM Trans. Graph..

[50]  Andreas Geiger,et al.  GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  George Drettakis,et al.  Scalable inside-out image-based rendering , 2016, ACM Trans. Graph..

[52]  Hans-Peter Seidel,et al.  Video-based characters: creating new human performances from a multi-view video database , 2011, ACM Trans. Graph..

[53]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[54]  Qiang Hu,et al.  Multi-View Neural Human Rendering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[56]  Jonathan T. Barron,et al.  Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields , 2021, ArXiv.

[57]  Tony Tung,et al.  Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Christian Theobalt,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Hongyi Xu,et al.  H-NeRF: Neural Radiance Fields for Rendering and Temporal Reconstruction of Humans in Motion , 2021, NeurIPS.

[61]  Helge Rhodin,et al.  A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose , 2021, NeurIPS.

[62]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, ACM Trans. Graph..

[63]  Iasonas Kokkinos,et al.  Dense Pose Transfer , 2018, ECCV.

[64]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Daniel Cohen-Or,et al.  SAPE: Spatially-Adaptive Progressive Encoding for Neural Optimization , 2021 .

[66]  Jingyi Yu,et al.  Editable free-viewpoint video using a layered neural representation , 2021, ACM Trans. Graph..

[67]  Jitendra Malik,et al.  Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach , 1996, SIGGRAPH.