Detailed 3D Model Driven Single View Scene Understanding

We present a data driven approach to holistic scene understanding. From a single image of an indoor scene, our approach estimates its detailed 3D geometry, i.e. The location of its walls and floor, and the 3D appearance of its containing objects, as well as its semantic meaning, i.e. A prediction of what objects it contains. This is made possible by using large datasets of detailed 3D models alongside appearance based detectors. We first estimate the 3D layout of a room, and extrapolate 2D object detection hypotheses to three dimensions to form bounding cuboids. Cuboids are converted to detailed 3D models of the predicted semantic category. Combinations of 3D models are used to create a large list of layout hypotheses for each image -- where each layout hypothesis is semantically meaningful and geometrically plausible. The likelihood of each layout hypothesis is ranked using a learned linear model -- and the hypothesis with the highest predicted likelihood is the final predicted 3D layout. Our approach is able to recover the detailed geometry of scenes, provide precise segmentation of objects in the image plane, and estimate objects' pose in 3D.

[1]  Joseph Schlecht,et al.  Sampling bedrooms , 2011, CVPR 2011.

[2]  Stephen Gould,et al.  Discriminative learning with latent variables for cluttered indoor scene understanding , 2010, CACM.

[3]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bernt Schiele,et al.  Detailed 3D Representations for Object Recognition and Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Martial Hebert,et al.  Data-Driven Scene Understanding from 3D Models , 2012, BMVC.

[6]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[7]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Daniel Fried,et al.  Bayesian geometric modeling of indoor scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[10]  Kobus Barnard,et al.  Understanding Bayesian Rooms Using Composite 3D Object Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Jitendra Malik,et al.  Inferring spatial layout from a single image via depth-ordered grouping , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[14]  Martial Hebert,et al.  3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Raquel Urtasun,et al.  Efficient Exact Inference for 3D Indoor Scene Understanding , 2012, ECCV.

[16]  Ce Liu,et al.  Depth Extraction from Video Using Non-parametric Sampling , 2012, ECCV.

[17]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[18]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Calvin C. Zhao Critical Review : Contour Detection and Hierarchical Image Segmentation , 2015 .

[20]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[21]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Abhinav Gupta,et al.  Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Stephen Gould,et al.  Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding , 2010, ECCV.

[25]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[26]  Takeo Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, CVPR.

[27]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[28]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[29]  David A. Forsyth,et al.  Rendering synthetic objects into legacy photographs , 2011, ACM Trans. Graph..