Shallow2Deep: Indoor Scene Modeling by Single Image Understanding

Abstract Dense indoor scene modeling from 2D images has been bottlenecked due to the absence of depth information and cluttered occlusions. We present an automatic indoor scene modeling approach using deep features from neural networks. Given a single RGB image, our method simultaneously recovers semantic contents, 3D geometry and object relationship by reasoning indoor environment context. Particularly, we design a shallow-to-deep architecture on the basis of convolutional networks for semantic scene understanding and modeling. It involves multi-level convolutional networks to parse indoor semantics/geometry into non-relational and relational knowledge. Non-relational knowledge extracted from shallow-end networks (e.g. room layout, object geometry) is fed forward into deeper levels to parse relational semantics (e.g. support relationship). A Relation Network is proposed to infer the support relationship between objects. All the structured semantics and geometry above are assembled to guide a global optimization for 3D scene modeling. Qualitative and quantitative analysis demonstrates the feasibility of our method in understanding and modeling semantics-enriched indoor scenes by evaluating the performance of reconstruction accuracy, computation performance and scene complexity.

[1]  Xuming He,et al.  Indoor Scene Parsing with Instance Segmentation, Semantic Labeling and Support Relationship Inference , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Shi-Min Hu,et al.  3D indoor scene modeling from RGB-D data: a survey , 2015, Computational Visual Media.

[4]  Alan L. Yuille,et al.  Manhattan World: compass direction from a single image by Bayesian inference , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jun Wang,et al.  Data-Driven Indoor Scene Modeling from a Single Color Image with Iterative Object Segmentation and Model Retrieval , 2020, IEEE Transactions on Visualization and Computer Graphics.

[7]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[8]  M. Powell The BOBYQA algorithm for bound constrained optimization without derivatives , 2009 .

[9]  Rafael Grompone von Gioi,et al.  LSD: a Line Segment Detector , 2012, Image Process. Line.

[10]  Meng Wang,et al.  Towards efficient support relation extraction from RGBD images , 2015, Inf. Sci..

[11]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[13]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[14]  Jian J. Zhang,et al.  Semantic modeling of indoor scenes with support inference from a single photograph , 2018, Comput. Animat. Virtual Worlds.

[15]  C.-C. Jay Kuo,et al.  A Coarse-to-Fine Indoor Layout Estimation (CFILE) Method , 2016, ACCV.

[16]  Antonio Torralba,et al.  FPM: Fine Pose Parts-Based Model with 3D CAD Models , 2014, ECCV.

[17]  Niloy J. Mitra,et al.  Creating consistent scene graphs using a probabilistic grammar , 2014, ACM Trans. Graph..

[18]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shi-Min Hu,et al.  Sketch2Scene: sketch-based co-retrieval and co-placement of 3D models , 2013, ACM Trans. Graph..

[21]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Silvio Savarese,et al.  Indoor Scene Understanding with Geometric and Semantic Contexts , 2014, International Journal of Computer Vision.

[23]  Junwei Han,et al.  Scene parsing using inference Embedded Deep Networks , 2016, Pattern Recognit..

[24]  Jian Yang,et al.  Deep hierarchical guidance and regularization learning for end-to-end depth estimation , 2018, Pattern Recognit..

[25]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Ian D. Reid,et al.  Single View Metrology , 2000, International Journal of Computer Vision.

[27]  Tsuhan Chen,et al.  3D-Based Reasoning with Blocks, Support, and Stability , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Longin Jan Latecki,et al.  Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[30]  Jianhua Lu,et al.  Hierarchical objectness network for region proposal generation and object detection , 2018, Pattern Recognit..

[31]  Hui Wei,et al.  Understanding of indoor scenes based on projection of spatial rectangles , 2018, Pattern Recognit..

[32]  Jiwen Lu,et al.  Scene recognition with objectness , 2018, Pattern Recognit..

[33]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[34]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[35]  Wei Zhang,et al.  Video Compass , 2002, ECCV.

[36]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[37]  Jun Wang,et al.  Indoor scene modeling from a single image using normal inference and edge features , 2017, The Visual Computer.

[38]  Kai Liu,et al.  Singe image-based data-driven indoor scenes modeling , 2015, Comput. Graph..

[39]  Leonidas J. Guibas,et al.  Joint embeddings of shapes and images via CNN image purification , 2015, ACM Trans. Graph..

[40]  Guosheng Lin,et al.  Exploring Context with Deep Structured Models for Semantic Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jiajun Wu,et al.  Learning Shape Priors for Single-View 3D Completion and Reconstruction , 2018, ECCV.

[42]  José García Rodríguez,et al.  A Review on Deep Learning Techniques Applied to Semantic Segmentation , 2017, ArXiv.

[43]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[44]  Song-Chun Zhu,et al.  Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image , 2018, ECCV.

[45]  Jian Yao,et al.  2-Line Exhaustive Searching for Real-Time Vanishing Point Estimation in Manhattan World , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  Niloy J. Mitra,et al.  SmartAnnotator An Interactive Tool for Annotating Indoor RGBD Images , 2015, Comput. Graph. Forum.

[47]  Ersin Yumer,et al.  SeeThrough: Finding Objects in Heavily Occluded Indoor Scene Images , 2018, 2018 International Conference on 3D Vision (3DV).

[48]  Tomasz Malisiewicz,et al.  RoomNet: End-to-End Room Layout Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Song-Chun Zhu,et al.  Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation , 2018, NeurIPS.

[50]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[52]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).