Creating 3D Bounding Box Hypotheses From Deep Network Score-Maps

There are two common paradigms for indoor scene understanding, pixel-level labeling and bounding box generation. The two tasks have a complementary nature but are normally achieved separately with different computational flows. We propose a novel method to bridge the two tasks by creating category-specific 3D bounding box hypotheses from score-maps of any deep networks trained on pixel-level semantic labels along with depth data. Those hypotheses can be further used to locate all objects as different non-overlapping bounding boxes by incorporating high-level knowledge, such as common room settings, co-existence or co-exclusiveness etc. We develop an objective function that involves confidence scores and the depth visibility to initialize and optimize multiple hypotheses for each category-specific score map. Experiment results show that our method significantly outperforms direct bounding box generation using pixel-level labeling.

[1]  V. Torczon,et al.  Direct search methods: then and now , 2000 .

[2]  Guoliang Fan,et al.  Robust object detection by cuboid matching with local plane optimization in indoor RGB-D images , 2017, 2017 IEEE Visual Communications and Image Processing (VCIP).

[3]  Silvio Savarese,et al.  Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[7]  Ling Shao,et al.  3D object detection: Learning 3D bounding boxes from scaled down 2D bounding boxes in RGB-D images , 2019, Inf. Sci..

[8]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Toby Sharp,et al.  Image segmentation with a bounding box prior , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Jianxiong Xiao,et al.  Reconstructing the World’s Museums , 2012, International Journal of Computer Vision.

[11]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[18]  Jianxiong Xiao,et al.  A Linear Approach to Matching Cuboids in RGBD Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ricardo Cabral,et al.  Piecewise Planar and Compact Floorplan Reconstruction from Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Zhen Li,et al.  LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling , 2016, ECCV.