论文信息 - Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape , an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks.

[1] Jiajun Wu,et al. Visual Object Networks: Image Generation with Disentangled 3D Representations , 2018, NeurIPS.

[2] Jitendra Malik,et al. Category-specific object reconstruction from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jiajun Wu,et al. Learning to Reconstruct Shapes from Unseen Classes , 2018, NeurIPS.

[5] William T. Freeman,et al. Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Vittorio Ferrari,et al. Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision , 2018, BMVC.

[7] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[8] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[9] Ashutosh Saxena,et al. Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Trevor Darrell,et al. Adversarial Feature Learning , 2016, ICLR.

[11] R. Shepard,et al. Mental Rotation of Three-Dimensional Objects , 1971, Science.

[12] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[13] Michael J. Black,et al. OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[14] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[15] Jiajun Wu,et al. MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[16] Jiajun Wu,et al. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[17] Yuandong Tian,et al. Single Image 3D Interpreter Network , 2016, ECCV.

[18] Qiang Xu,et al. nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Joshua B. Tenenbaum,et al. Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[20] Aaron C. Courville,et al. Adversarially Learned Inference , 2016, ICLR.

[21] Jaakko Lehtinen,et al. Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[22] James T. Kajiya,et al. The rendering equation , 1986, SIGGRAPH.

[23] Tatsuya Harada,et al. Learning View Priors for Single-View 3D Reconstruction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Andrew Zisserman,et al. SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes , 2017, BMVC.

[25] Max Jaderberg,et al. Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[26] Honglak Lee,et al. Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[27] Hao Li,et al. Soft Rasterizer: Differentiable Rendering for Unsupervised Single-View Mesh Reconstruction , 2019, ArXiv.

[28] Andreas Krause,et al. Advances in Neural Information Processing Systems (NIPS) , 2014 .

[29] Andrea Vedaldi,et al. Learning 3D Object Categories by Looking Around Them , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Gregory R. Koch,et al. Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[31] Dima Damen,et al. Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Leif Kobbelt,et al. A survey of point-based techniques in computer graphics , 2004, Comput. Graph..

[33] Siddhartha Chaudhuri,et al. A probabilistic model for component-based shape synthesis , 2012, ACM Trans. Graph..

[34] Matthias Nießner,et al. Convolutional Neural Networks on non-uniform geometrical signals using Euclidean spectral transformation , 2019, ICLR.

[35] Jun Li,et al. Im2Struct: Recovering 3D Shape Structure from a Single RGB Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[37] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[38] Matthias Zwicker,et al. Surfels: surface elements as rendering primitives , 2000, SIGGRAPH.

[39] Alexey Dosovitskiy,et al. Unsupervised Learning of Shape and Pose with Differentiable Point Clouds , 2018, NeurIPS.

[40] Silvio Savarese,et al. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[41] Yoshua Bengio,et al. Mutual Information Neural Estimation , 2018, ICML.

[42] Jianxiong Xiao,et al. 3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Tatsuya Harada,et al. Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Lukás Burget,et al. Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[45] Leonidas J. Guibas,et al. Probabilistic reasoning for assembly-based 3D modeling , 2011, ACM Trans. Graph..

[46] Yong-Liang Yang,et al. HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[47] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[48] Allan Hanbury,et al. An Efficient Algorithm for Calculating the Exact Hausdorff Distance , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49] Lawrence Carin,et al. ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching , 2017, NIPS.

[50] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[51] Andrea Vedaldi,et al. C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52] Jitendra Malik,et al. Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[53] W. Groß. Grundzüge der Mengenlehre , 1915 .

[54] Leonidas J. Guibas,et al. Learning Shape Abstractions by Assembling Volumetric Primitives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[56] Subhransu Maji,et al. 3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[57] Leonidas J. Guibas,et al. FrameNet: Learning Local Canonical Frames of 3D Surfaces From a Single RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58] Abhinav Gupta,et al. Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[59] Jiajun Wu,et al. Synthesizing 3D Shapes via Modeling Multi-view Depth Maps and Silhouettes with Deep Generative Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).