Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

. Approaches to single-view reconstruction typically rely on viewpoint annotations, silhouettes, the absence of background, multiple views of the same instance, a template shape, or symmetry. We avoid all of these supervisions and hypotheses by leveraging explicitly the consistency between images of different object instances. As a result, our method can learn from large collections of unlabelled images depicting the same object category. Our main contributions are two approaches to leverage cross-instance consistency: (i) progressive conditioning , a training strategy to gradually specialize the model from category to instances in a curriculum learning fashion; (ii) swap reconstruction , a loss enforcing consistency between instances having similar shape or texture. Critical to the success of our method are also: our structured autoencoding architecture decomposing an image into explicit shape, texture, pose, and background; an adapted formulation of differential rendering, and; a new optimization scheme alternating between 3D and pose learning. We compare our approach, UNICORN, both on the diverse synthetic ShapeNet dataset - the classical benchmark for methods requiring multiple views as supervision - and on standard real-image benchmarks (Pascal3D+ Car, CUB-200) for which most methods require known templates and silhouette annotations. We also showcase applicability to more challenging real-world collections (CompCars, LSUN), where silhouettes are not available and images are not cropped around the object. demonstrate high-quality and realistic textured 3D reconstructions.

[1]  Deva Ramanan,et al.  NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild , 2021, NeurIPS.

[2]  Jean Ponce,et al.  Unsupervised Layered Image Decomposition into Object Prototypes , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  A. Makadia,et al.  De-rendering the World’s Revolutionary Artefacts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andreas Geiger,et al.  GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  A. Torralba,et al.  Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering , 2020, ICLR.

[6]  Jitendra Malik,et al.  Shape and Viewpoint without Keypoints , 2020, ECCV.

[7]  Abhinav Gupta,et al.  Implicit Mesh Reconstruction from Unannotated Image Collections , 2020, ArXiv.

[8]  Adrien Gaidon,et al.  Differentiable Rendering: A Survey , 2020, ArXiv.

[9]  Mathieu Aubry,et al.  Deep Transformation-Invariant Clustering , 2020, NeurIPS.

[10]  Christoph H. Lampert,et al.  Leveraging 2D Data to Learn Textured 3D Mesh Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[12]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  A. Vedaldi,et al.  Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild , 2019, Computer Vision and Pattern Recognition.

[14]  Wan-Yen Lo,et al.  Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[15]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[16]  S. Fidler,et al.  Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[17]  Duygu Ceylan,et al.  DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[18]  Hao Li,et al.  Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[20]  Vittorio Ferrari,et al.  Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading , 2019, International Journal of Computer Vision.

[21]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alexey Dosovitskiy,et al.  Unsupervised Learning of Shape and Pose with Differentiable Point Clouds , 2018, NeurIPS.

[24]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[25]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[26]  Mathieu Aubry,et al.  AtlasNet: A Papier-M\^ach\'e Approach to Learning 3D Surface Generation , 2018, CVPR 2018.

[27]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Jitendra Malik,et al.  Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Alexei A. Efros,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[33]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[36]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[37]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[38]  Xiaoou Tang,et al.  A large-scale car dataset for fine-grained categorization and verification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jitendra Malik,et al.  Category-specific object reconstruction from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[44]  Lourdes Agapito,et al.  Reconstructing PASCAL VOC , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[46]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[47]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[48]  Marc Alexa,et al.  Laplacian mesh optimization , 2006, GRAPHITE '06.

[49]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[50]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[51]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[52]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[53]  Mark Meyer,et al.  Implicit fairing of irregular meshes using diffusion and curvature flow , 1999, SIGGRAPH.

[54]  Stanley Finger,et al.  Origins of neuroscience: A history of explorations into brain function. , 1994 .

[55]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[56]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..