Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis

An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is particularly challenging due to the partial observability inherent in projecting a 3D object onto the image space, and the ill-posedness of inferring object shape and pose. However, we can train a neural network to address the problem if we restrict our attention to specific object categories (in our case faces and chairs) for which we can gather ample training data. In this paper, we propose a novel recurrent convolutional encoder-decoder network that is trained end-to-end on the task of rendering rotated objects starting from a single image. The recurrent structure allows our model to capture long-term dependencies along a sequence of transformations. We demonstrate the quality of its predictions for human faces on the Multi-PIE dataset and for a dataset of 3D chair models, and also show its ability to disentangle latent factors of variation (e.g., identity and pose) without using full supervision.

[1]  R. Shepard,et al.  Mental Rotation of Three-Dimensional Objects , 1971, Science.

[2]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[3]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[4]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[7]  Sven J. Dickinson,et al.  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model , 2012, NIPS.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[11]  Graham W. Taylor,et al.  "Mental Rotation" by Optimizing Transforming Distance , 2014, ArXiv.

[12]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[13]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Yuting Zhang,et al.  Learning to Disentangle Factors of Variation with Manifold Interaction , 2014, ICML.

[17]  Xiaogang Wang,et al.  Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations , 2014, NIPS.

[18]  Tijmen Tieleman,et al.  Optimizing Neural Networks that Generate Iimages , 2014 .

[19]  Yaser Sheikh,et al.  3D object manipulation in a single photograph using stock 3D models , 2014, ACM Trans. Graph..

[20]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[21]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Mathieu Aubry,et al.  Understanding Deep Features with Computer-Generated Imagery , 2015, ICCV.

[25]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Xiangyu Zhu,et al.  High-fidelity Pose and Expression Normalization for face recognition in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Bruno A. Olshausen,et al.  Discovering Hidden Factors of Variation in Deep Networks , 2014, ICLR.

[30]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.