Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild

We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.

[1]  Michael J. Black,et al.  Generating 3D faces using Convolutional Mesh Autoencoders , 2018, ECCV.

[2]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[5]  Jaakko Lehtinen,et al.  Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[6]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[7]  Michael J. Black,et al.  Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Richard Szeliski,et al.  Detecting and Reconstructing 3D Mirror Symmetric Objects , 2012, ECCV.

[9]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[10]  Tatsuya Harada,et al.  Learning View Priors for Single-View 3D Reconstruction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[13]  Andrea Vedaldi,et al.  Unsupervised learning of object frames by dense equivariant image labelling , 2017, NIPS.

[14]  David J. Kriegman,et al.  The Bas-Relief Ambiguity , 2004, International Journal of Computer Vision.

[15]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Jonathan Tompson,et al.  Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning , 2018, NeurIPS.

[17]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Paolo Favaro,et al.  Unsupervised Generative 3D Shape Learning from Natural Images , 2019, ArXiv.

[19]  Sergey Tulyakov,et al.  3D Guided Fine-Grained Face Manipulation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sebastian Thrun,et al.  Shape from symmetry , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21]  Sina Honari,et al.  Unsupervised Depth Estimation, 3D Face Rotation and Replacement , 2018, NeurIPS.

[22]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[23]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[24]  Andrea Vedaldi,et al.  Learning 3D Object Categories by Looking Around Them , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yuan Gao,et al.  Exploiting Symmetry and/or Manhattan Properties for 3D Object Structure Estimation from Single and Multiple Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Carlos D. Castillo,et al.  SfSNet: Learning Shape, Reflectance and Illuminance of Faces 'in the Wild' , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[28]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Tobias Ritschel,et al.  Escaping Plato's Cave using Adversarial Training: 3D Shape From Unstructured 2D Image Collections , 2018, ArXiv.

[31]  Lijun Yin,et al.  A high-resolution 3D dynamic facial expression database , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[32]  Berthold K. P. Horn Obtaining shape from shading information , 1989 .

[33]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[34]  Bernhard Egger,et al.  Morphable Face Models - An Open Framework , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[35]  Yuri Odagiri,et al.  Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations , 2018, ArXiv.

[36]  Iasonas Kokkinos,et al.  Lifting AutoEncoders: Unsupervised Learning of a Fully-Disentangled 3D Morphable Model Using Deep Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[37]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[38]  J J Koenderink,et al.  What Does the Occluding Contour Tell Us about Solid Shape? , 1984, Perception.

[39]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[40]  Liang Lin,et al.  Single View Stereo Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[42]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Takeo Kanade,et al.  Dense 3D face alignment from 2D videos in real-time , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[44]  Andrew Zisserman,et al.  Shape from symmetry: detecting and exploiting symmetry in affine images , 1995, Philosophical Transactions of the Royal Society of London. Series A: Physical and Engineering Sciences.

[45]  James M. Rehg,et al.  Unsupervised 3D Pose Estimation With Geometric Self-Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Vittorio Ferrari,et al.  Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading , 2019, International Journal of Computer Vision.

[47]  Jiajun Wu,et al.  Visual Object Networks: Image Generation with Disentangled 3D Representations , 2018, NeurIPS.

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Stefanos Zafeiriou,et al.  GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Weiwei Zhang,et al.  Cat Head Detection - How to Effectively Exploit Shape and Texture Features , 2008, ECCV.

[52]  Andrew Zisserman,et al.  Face Painting: querying art with photos , 2015, BMVC.

[53]  Olivier D. Faugeras,et al.  Shape From Shading , 2006, Handbook of Mathematical Models in Computer Vision.

[54]  Andrea Vedaldi,et al.  Modelling and unsupervised learning of symmetric deformable object categories , 2018, NeurIPS.

[55]  Mengjiao Wang,et al.  An Adversarial Neuro-Tensorial Approach for Learning Disentangled Representations , 2017, International Journal of Computer Vision.

[56]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[57]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[58]  A. U.S.,et al.  Recovering Surface Shape and Orientation from Texture , 2002 .

[59]  Iasonas Kokkinos,et al.  Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance , 2018, ECCV.

[60]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[61]  O. Faugeras,et al.  The Geometry of Multiple Images , 1999 .

[62]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  David J. Kriegman,et al.  The Bas-Relief Ambiguity , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[65]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[67]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[68]  Leonidas J. Guibas,et al.  Learning Representations and Generative Models for 3D Point Clouds , 2017, ICML.

[69]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Gérard G. Medioni,et al.  Mirror symmetry => 2-view stereo geometry , 2003, Image Vis. Comput..

[71]  Andrea Vedaldi,et al.  C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[73]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[74]  Hao Li,et al.  Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.