BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Prior work for articulated 3D shape reconstruction often relies on specialized multi-view and depth sensors or pre-built deformable 3D models. Such methods do not scale to diverse sets of objects in the wild. We present a method that requires neither of them. It aims to create high-fidelity, articulated 3D models from many casual RGB videos in a differentiable rendering framework. Our key in-sight is to merge three schools of thought: (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) canonical embeddings that establish correspondences between pixels and a canonical 3D model, and (3) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, our method shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints. Project page: https://banmo-www.github.io/.

[1]  Yaser Sheikh,et al.  Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in the Wild , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Deva Ramanan,et al.  NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild , 2021, NeurIPS.

[3]  Minsu Cho,et al.  Self-Calibrating Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Jia Deng,et al.  DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[5]  A. Vedaldi,et al.  DOVE: Learning Deformable 3D Objects by Watching Videos , 2021, International Journal of Computer Vision.

[6]  Jonathan T. Barron,et al.  HyperNeRF , 2021, ACM Trans. Graph..

[7]  Yaron Lipman,et al.  Volume Rendering of Neural Implicit Surfaces , 2021, NeurIPS.

[8]  Taku Komura,et al.  NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction , 2021, ArXiv.

[9]  Iasonas Kokkinos,et al.  To The Point: Correspondence-driven monocular 3D category reconstruction , 2021, NeurIPS.

[10]  Christian Theobalt,et al.  Neural actor , 2021, ACM Trans. Graph..

[11]  Andrea Vedaldi,et al.  Discovering Relationships between Object Categories via Universal Canonical Maps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yasamin Jafarian,et al.  Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Simon Lucey,et al.  Neural Trajectory Fields for Dynamic Novel View Synthesis , 2021, ArXiv.

[14]  Varun Jampani,et al.  LASR: Learning Articulated Shape Reconstruction from a Monocular Video , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hujun Bao,et al.  Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Antonio Torralba,et al.  BARF: Bundle-Adjusting Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Michael J. Black,et al.  SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Stephen Lin,et al.  Neural Articulated Radiance Field , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Michael J. Black,et al.  SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Hao Su,et al.  GNeRF: GAN-based Neural Radiance Field without Posed Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Edgar Sucar,et al.  iMAP: Implicit Mapping and Positioning in Real-Time , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  V. Prisacariu,et al.  NeRF-: Neural Radiance Fields Without Known Camera Parameters , 2021, ArXiv.

[23]  Shubham Tulsiani,et al.  Shelf-Supervised Mesh Prediction in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Helge Rhodin,et al.  A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose , 2021, NeurIPS.

[25]  Hujun Bao,et al.  Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  M. Zollhöfer,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zhengqi Li,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  D. Ramanan,et al.  ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction , 2021, NeurIPS.

[32]  Xueting Li,et al.  Online Adaptation for Consistent Mesh Reconstruction in the Wild , 2020, NeurIPS.

[33]  Andrea Vedaldi,et al.  Continuous Surface Embeddings , 2020, NeurIPS.

[34]  Kostas Daniilidis,et al.  3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View , 2020, ECCV.

[35]  C. Stoll,et al.  TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video , 2020, ECCV.

[36]  Roberto Cipolla,et al.  Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop , 2020, ECCV.

[37]  Abhinav Gupta,et al.  Implicit Mesh Reconstruction from Unannotated Image Collections , 2020, ArXiv.

[38]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[39]  Abhinav Gupta,et al.  Articulation-Aware Canonical Surface Mapping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[41]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[42]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[43]  Ross B. Girshick,et al.  PointRend: Image Segmentation As Rendering , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Suryansh Kumar,et al.  Non-Rigid Structure from Motion: Prior-Free Factorization Method Revisited , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  Christian Theobalt,et al.  Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints , 2020, ECCV.

[47]  Song-Chun Zhu,et al.  DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Michael J. Black,et al.  Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture From Images “In the Wild” , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Simon Lucey,et al.  Deep Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Thomas Brox,et al.  What Do Single-View 3D Reconstruction Networks Learn? , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[54]  David Picard,et al.  Human Pose Regression by Combining Indirect Part Detection and Contextual Information , 2017, Comput. Graph..

[55]  Deva Ramanan,et al.  Volumetric Correspondence Networks for Optical Flow , 2019, NeurIPS.

[56]  Michael J. Black,et al.  Lions and Tigers and Bears: Capturing Non-rigid, 3D, Articulated Shape from Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[58]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Michael J. Black,et al.  3D Menagerie: Modeling the 3D Shape and Pose of Animals , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[62]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Alec Jacobson,et al.  Skinning: real-time shape deformation , 2014, SIGGRAPH ASIA Courses.

[64]  Aleix M. Martínez,et al.  Non-rigid structure from motion with complementary rank-3 spaces , 2011, CVPR 2011.

[65]  Kurt Keutzer,et al.  Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow , 2010, ECCV.

[66]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[67]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[68]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[69]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[70]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[71]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).