Visual Object Networks: Image Generation with Disentangled 3D Representations

Recent progress in deep generative models has led to tremendous breakthroughs in image generation. While being able to synthesize photorealistic images, existing models lack an understanding of our underlying 3D world. Different from previous works built on 2D datasets and models, we present a new generative model, Visual Object Networks (VONs), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rendering pipelines, we unravel the image formation process into three conditionally independent factors---shape, viewpoint, and texture---and present an end-to-end adversarial learning framework that jointly models 3D shape and 2D texture. Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then renders the object's 2.5D sketches (i.e., silhouette and depth map) from its shape under a sampled viewpoint. Finally, it learns to add realistic textures to these 2.5D sketches to generate realistic images. The VON not only generates images that are more realistic than the state-of-the-art 2D image synthesis methods but also enables many 3D operations such as changing the viewpoint of a generated image, shape and texture editing, linear interpolation in texture and shape space, and transferring appearance across different objects and viewpoints.

[1]  H. Barrow,et al.  RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES , 1978 .

[2]  A. Witkin,et al.  Recovering intrinsic scene characteristics from images , 1981 .

[3]  David Marr,et al.  Vision: A computational investigation into the human representation , 1983 .

[4]  James T. Kajiya,et al.  The rendering equation , 1986, SIGGRAPH.

[5]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[6]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[7]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[8]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9]  David Poeppel,et al.  Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision , 2010, Biolinguistics.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[15]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[16]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[22]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[23]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[26]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[27]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[29]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[31]  Thomas Brox,et al.  Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[32]  Alexei A. Efros,et al.  Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[33]  Chao Yang,et al.  Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[35]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[36]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Subhransu Maji,et al.  Shape Generation using Spatially Partitioned Point Clouds , 2017, BMVC.

[39]  Ersin Yumer,et al.  Neural Face Editing with Intrinsic Image Disentangling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[41]  Leonidas J. Guibas,et al.  Learning Shape Abstractions by Assembling Volumetric Primitives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[43]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[46]  Matthias Nießner,et al.  Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[49]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[50]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[51]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[53]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[56]  Jiajun Wu,et al.  Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[58]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[59]  Philip Bachman,et al.  Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data , 2018, ICML.

[60]  Leonidas J. Guibas,et al.  Learning Representations and Generative Models for 3D Point Clouds , 2017, ICML.

[61]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[62]  Yue Wang,et al.  Im2Avatar: Colorful 3D Reconstruction from a Single Image , 2018, ArXiv.

[63]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.