3D Room Layout Estimation From a Single RGB Image

3D layout is crucial for scene understanding and reconstruction, and very useful in applications like real estate and furniture design. In this paper, we propose a fully automatic solution to estimate 3D layout of an indoor scene from a single 2D image. Our technique contains two key components. Firstly, we train a neural network that directly estimates room structure lines from the input image. Secondly, we propose a novel technique to automatically identify the layout topology of an input image, followed by a nonlinear optimization with equality constraints to estimate the final 3D layout of a scene. Based on our knowledge, this is the first fully automatic technique to achieve single image-based 3D layout estimation of an indoor scene. We evaluate our method on the public datasets <inline-formula><tex-math notation="LaTeX">$LSUN$</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">$Hedau$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$3DGP$</tex-math></inline-formula> and the results show that the proposed method achieves accurate 3D layout reconstruction on various images with different layout topologies.

[1]  Zhihai He,et al.  Task-Driven Progressive Part Localization for Fine-Grained Object Recognition , 2016, IEEE Transactions on Multimedia.

[2]  Stefano Soatto,et al.  Shape and Radiance Estimation from the Information-Divergence of Blurred Images , 2000, ECCV.

[3]  Yongdong Zhang,et al.  A Fast Uyghur Text Detector for Complex Background Images , 2018, IEEE Transactions on Multimedia.

[4]  Ian D. Reid,et al.  Single View Metrology , 2000, International Journal of Computer Vision.

[5]  Honglak Lee,et al.  Automatic Single-Image 3d Reconstructions of Indoor Manhattan World Scenes , 2007, ISRR.

[6]  Kobus Barnard,et al.  Understanding Bayesian Rooms Using Composite 3D Object Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Guo-Jun Qi,et al.  Hierarchically Gated Deep Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zihan Zhou,et al.  Detecting Dominant Vanishing Points in Natural Scenes with Application to Composition-Sensitive Image Retrieval , 2016, IEEE Transactions on Multimedia.

[9]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Silvio Savarese,et al.  DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Marc Pollefeys,et al.  Efficient structured prediction for 3D indoor scene understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Chenggang Yan,et al.  Deep Multi-View Enhancement Hashing for Image Retrieval , 2020, IEEE transactions on pattern analysis and machine intelligence.

[13]  Mei Han,et al.  Interactive construction of 3D models from panoramic mosaics , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[14]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Eitan Marder-Eppstein,et al.  Project Tango , 2016, SIGGRAPH Real-Time Live!.

[18]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jason Jianjun Gu,et al.  Learning to Predict High-Quality Edge Maps for Room Layout Estimation , 2017, IEEE Transactions on Multimedia.

[20]  Stephen Gould,et al.  Discriminative learning with latent variables for cluttered indoor scene understanding , 2010, CACM.

[21]  Jinhui Tang,et al.  RGB-D Object Recognition via Incorporating Latent Data Structure and Prior Knowledge , 2015, IEEE Transactions on Multimedia.

[22]  Jitendra Malik,et al.  Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach , 1996, SIGGRAPH.

[23]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[24]  Frank Dellaert,et al.  Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[25]  Derek Hoiem,et al.  LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yongdong Zhang,et al.  Convolutional Attention Networks for Scene Text Recognition , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[28]  Ersin Yumer,et al.  Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[31]  Stephen J. Maybank,et al.  A Method for Interactive 3D Reconstruction of Piecewise Planar Objects from Single Images , 1999, BMVC.

[32]  Avinash C. Kak,et al.  Fast Vision-guided Mobile Robot Navigation Using Model-based Reasoning And Prediction Of Uncertainties , 1992, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[34]  Alan L. Yuille,et al.  Manhattan World: compass direction from a single image by Bayesian inference , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[35]  Feng Han,et al.  Bayesian reconstruction of 3D shapes and scenes from a single image , 2003, First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003. HLK 2003..

[36]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[37]  Yongdong Zhang,et al.  Double-Bit Quantization and Index Hashing for Nearest Neighbor Search , 2019, IEEE Transactions on Multimedia.

[38]  Yinda Zhang,et al.  PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding , 2014, ECCV.

[39]  Svetlana Lazebnik,et al.  Learning Informative Edge Maps for Indoor Scene Layout Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Qionghai Dai,et al.  Cross-Modality Bridging and Knowledge Transferring for Image Understanding , 2019, IEEE Transactions on Multimedia.

[41]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[42]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  James M. Coughlan,et al.  Manhattan World: Orientation and Outlier Detection by Bayesian Inference , 2003, Neural Computation.

[44]  Steven M. Seitz,et al.  Interactive Room Capture on 3D-Aware Mobile Devices , 2017, UIST.

[45]  Li Zhang,et al.  Physics Inspired Optimization on Semantic Transfer Features: An Alternative Method for Room Layout Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Zhu Liu,et al.  Joint scene classification and segmentation based on hidden Markov model , 2005, IEEE Transactions on Multimedia.

[47]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Daniel Fried,et al.  Bayesian geometric modeling of indoor scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yongdong Zhang,et al.  Automated pulmonary nodule detection in CT images using deep convolutional neural networks , 2019, Pattern Recognit..

[50]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[51]  Takeo Kanade,et al.  Appearance-based virtual view generation from multicamera videos captured in the 3-D room , 2003, IEEE Trans. Multim..

[52]  Tomasz Malisiewicz,et al.  RoomNet: End-to-End Room Layout Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).