Unsupervised Learning of 3D Object Categories from Videos in the Wild

Our goal is to learn a deep network that, given a small number of images of an object of a given category, reconstructs it in 3D. While several recent works have obtained analogous results using synthetic data or assuming the avail-ability of 2D primitives such as keypoints, we are interested in working with challenging real data and with no manual annotations. We thus focus on learning a model from multiple views of a large collection of object instances. We contribute with a new large dataset of object centric videos suitable for training and benchmarking this class of models. We show that existing techniques leveraging meshes, voxels, or implicit surfaces, which work well for reconstructing isolated objects, fail on this challenging data. Finally, we propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction while obtaining a detailed implicit representation of the object surface and texture, also compensating for the noise in the initial SfM reconstruction that bootstrapped the learning process. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks and on our novel dataset. For additional material please visit: https://henzler.github.io/publication/unsupervised_videos/.

[1]  Pratul P. Srinivasan,et al.  IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Angjoo Kanazawa,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Torralba,et al.  Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering , 2020, ICLR.

[4]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kyaw Zaw Lin,et al.  Neural Sparse Voxel Fields , 2020, NeurIPS.

[6]  Jitendra Malik,et al.  Shape and Viewpoint without Keypoints , 2020, ECCV.

[7]  Andreas Geiger,et al.  GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis , 2020, NeurIPS.

[8]  Abhinav Gupta,et al.  Articulation-Aware Canonical Surface Mapping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hanbyul Joo,et al.  PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[11]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[12]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[13]  Andrea Vedaldi,et al.  Capturing the Geometry of Object Categories from Video Supervision , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Funkhouser,et al.  Local Deep Implicit Functions for 3D Shape , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  A. Vedaldi,et al.  Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Y. Lipman,et al.  SAL: Sign Agnostic Learning of Shapes From Raw Data , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Wan-Yen Lo,et al.  Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[19]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[20]  Andreas Geiger,et al.  Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  S. Fidler,et al.  Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[22]  Shubham Tulsiani,et al.  Canonical Surface Mapping via Geometric Cycle Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ming-Yu Liu,et al.  PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jitendra Malik,et al.  Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Pavlo Molchanov,et al.  SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Thomas A. Funkhouser,et al.  Learning Shape Templates With Structured Implicit Functions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Hao Li,et al.  Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[30]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jan-Michael Frahm,et al.  Deep blending for free-viewpoint image-based rendering , 2018, ACM Trans. Graph..

[34]  Tobias Ritschel,et al.  Escaping Plato's Cave using Adversarial Training: 3D Shape From Unstructured 2D Image Collections , 2018, ArXiv.

[35]  Alexey Dosovitskiy,et al.  Unsupervised Learning of Shape and Pose with Differentiable Point Clouds , 2018, NeurIPS.

[36]  Chongyang Ma,et al.  Deep Volumetric Video From Very Sparse Multi-view Performance Capture , 2018, ECCV.

[37]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[38]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[39]  Jitendra Malik,et al.  Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Andrea Vedaldi,et al.  Learning 3D Object Categories by Looking Around Them , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Alexei A. Efros,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[46]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[48]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[49]  Andrea Vedaldi,et al.  Learning the semantic structure of objects from Web supervision , 2016, ArXiv.

[50]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[51]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[53]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[54]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[56]  Thomas Brox,et al.  Unsupervised Generation of a View Point Annotated Car Dataset from Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Jitendra Malik,et al.  Virtual view networks for object reconstruction , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[59]  Lourdes Agapito,et al.  Reconstructing PASCAL VOC , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[61]  Andrew W. Fitzgibbon,et al.  What Shape Are Dolphins? Building 3D Morphable Models from 2D Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[63]  Michael Bosse,et al.  Unstructured lumigraph rendering , 2001, SIGGRAPH.

[64]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[65]  Nelson L. Max,et al.  Optical Models for Direct Volume Rendering , 1995, IEEE Trans. Vis. Comput. Graph..