Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.

[1]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[3]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaowei Zhou,et al.  6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[7]  Abhinav Gupta,et al.  Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Abhinav Gupta,et al.  Marr Revisited: 2D-3D Alignment via Surface Normal Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Stefan Leutenegger,et al.  SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[11]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[13]  Jitendra Malik,et al.  Aligning 3D models to RGB-D images of cluttered scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[18]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[20]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[21]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[22]  Ersin Yumer,et al.  Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lawrence G. Roberts,et al.  Machine Perception of Three-Dimensional Solids , 1963, Outstanding Dissertations in the Computer Sciences.

[24]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Leonidas J. Guibas,et al.  Joint embeddings of shapes and images via CNN image purification , 2015, ACM Trans. Graph..

[27]  Sven J. Dickinson,et al.  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model , 2012, NIPS.

[28]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sanja Fidler,et al.  Box in the Box: Joint 3D Layout and Object Reasoning from Single Images , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[33]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[34]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.