A Point Set Generation Network for 3D Object Reconstruction from a Single Image

Generation of 3D data by deep neural network has been attracting increasing attention in the research community. The majority of extant works resort to regular representations such as volumetric grids or collection of images, however, these representations obscure the natural invariance of 3D shapes under geometric transformations, and also suffer from a number of other issues. In this paper we address the problem of 3D reconstruction from a single image, generating a straight-forward form of output – point cloud coordinates. Along with this problem arises a unique and interesting issue, that the groundtruth shape for an input image may be ambiguous. Driven by this unorthordox output form and the inherent ambiguity in groundtruth, we design architecture, loss function and learning paradigm that are novel and effective. Our final solution is a conditional shape sampler, capable of predicting multiple plausible 3D point clouds from an input image. In experiments not only can our system outperform state-of-the-art methods on single image based 3D reconstruction benchmarks, but it also shows strong performance for 3D shape completion and promising ability in making multiple plausible predictions.

[1]  Berthold K. P. Horn Obtaining shape from shading information , 1989 .

[2]  D. Bertsekas A distributed asynchronous relaxation algorithm for the assignment problem , 1985, 1985 24th IEEE Conference on Decision and Control.

[3]  Yehoshua Y. Zeevi,et al.  The farthest point strategy for progressive image sampling , 1997, IEEE Trans. Image Process..

[4]  Michael J. Swain,et al.  Shape from Texture , 1985, IJCAI.

[5]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[6]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[7]  Ashutosh Saxena,et al.  Learning 3-D Scene Structure from a Single Still Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Gabriele Peters,et al.  The structure-from-motion reconstruction pipeline - a survey with focus on short image sequences , 2010, Kybernetika.

[9]  José Ruíz Ascencio,et al.  Visual simultaneous localization and mapping: a survey , 2012, Artificial Intelligence Review.

[10]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[12]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[13]  Leonidas J. Guibas,et al.  Estimating image depth using shape collections , 2014, ACM Trans. Graph..

[14]  Jitendra Malik,et al.  Category-specific object reconstruction from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vladlen Koltun,et al.  Single-view reconstruction via joint analysis of image and shape collections , 2015, ACM Trans. Graph..

[16]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[18]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[19]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[20]  Lourdes Agapito,et al.  Lifting Object Detection Datasets into 3D , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[22]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.