Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation

In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset (Silberman et al., ECCV, 2012). We propose algorithms for object boundary detection and hierarchical segmentation that generalize the $$gPb-ucm$$gPb-ucm approach of Arbelaez et al. (TPAMI, 2011) by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing histogram of oriented gradients on the depth image and using them with deformable part models (Felzenszwalb et al., TPAMI, 2010). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.

[1]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[2]  G. Kanizsa,et al.  Organization in Vision: Essays on Gestalt Perception , 1979 .

[3]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[5]  Jitendra Malik,et al.  Recognizing Objects in Range Data Using Regional Point Descriptors , 2004, ECCV.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[10]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[11]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  T. Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[16]  Nico Blodow,et al.  Fast Point Feature Histograms (FPFH) for 3D registration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[17]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[19]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[21]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[22]  Alexei A. Efros,et al.  Recovering Occlusion Boundaries from an Image , 2011, International Journal of Computer Vision.

[23]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[24]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[25]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[27]  Cristian Sminchisescu,et al.  Object Recognition by Sequential Figure-Ground Ranking , 2012, International Journal of Computer Vision.

[28]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[29]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[30]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[31]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[32]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[33]  Jitendra Malik,et al.  Semantic segmentation using regions and parts , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[35]  Andrea Fossati,et al.  Consumer Depth Cameras for Computer Vision: Research Topics and Applications , 2012 .

[36]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[37]  David A. Forsyth,et al.  Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Xiaofeng Ren,et al.  Discriminatively Trained Sparse Code Gradients for Contour Detection , 2012, NIPS.

[39]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[40]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  James M. Keller,et al.  Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[42]  Dieter Fox,et al.  RGB-D Object Recognition: Features, Algorithms, and a Large Scale Benchmark , 2013, Consumer Depth Cameras for Computer Vision.

[43]  Silvio Savarese,et al.  Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Jitendra Malik,et al.  Object Detection in RGB-D Indoor Scenes 1 , 2013 .

[45]  Jitendra Malik,et al.  Intrinsic Scene Properties from a Single RGB-D Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Derek Hoiem,et al.  Learning Collections of Part Models for Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Andrea Fossati,et al.  Consumer Depth Cameras for Computer Vision , 2013, Advances in Computer Vision and Pattern Recognition.

[49]  Subhransu Maji,et al.  Efficient Classification for Additive Kernel SVMs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[52]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.