Constrained Parametric Proposals and Pooling Methods for Semantic Segmentation in RGB-D Images

We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered indoor scenes containing many instances from many visual categories. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layout hypotheses at multiple locations and scales in the image followed by a sequential inference algorithm that integrates the proposals into a complete scene estimate. Our contributions can be summarized as proposing the following: (1) a generalization of parametric max flow figure-ground proposal methodology to take advantage of intensity and depth information, in order to systematically and efficiently generate the breakpoints of an underlying spatial model in polynomial time, (2) new region description methods based on second-order pooling over multiple features constructed using both intensity and depth channels, (3) an inference procedure that can resolve conflicts in overlapping spatial partitions, and handles scenes with a large number of objects category instances, of very different scales, (4) extensive evaluation of the impact of depth, as well as the effectiveness of a large number of descriptors, both pre-designed and automatically obtained using deep learning, in a difficult RGB-D semantic segmentation problem with 92 classes. We report state of the art results in the challenging NYU Depth v2 dataset, extended for RMRC 2013 Indoor Segmentation Challenge, where currently the proposed model ranks first, with an average score of 24.61% and a number of 39 classes won. Moreover, we show that by combining second-order and deep learning features, over 15% relative accuracy improvements can be additionally achieved. In a scene classification benchmark, our methodology further improves the state of the art by 24%.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[3]  Alexei A. Efros,et al.  Recovering Occlusion Boundaries from a Single Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  David A. Forsyth,et al.  Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Loong Fah Cheong,et al.  Segmentation over Detection by Coupled Global and Local Sparse Representations , 2012, ECCV.

[6]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[7]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cristian Sminchisescu,et al.  Probabilistic Joint Image Segmentation and Labeling , 2011, NIPS.

[10]  Joost van de Weijer,et al.  Harmony potentials for joint classification and segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[12]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[14]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[16]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[17]  Cristian Sminchisescu,et al.  Efficient Closed-Form Solution to Generalized Boundary Detection , 2012, ECCV.

[18]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Cristian Sminchisescu,et al.  Video Object Segmentation by Salient Segment Chain Composition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[20]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[21]  Sven J. Dickinson,et al.  Optimal Contour Closure by Superpixel Grouping , 2010, ECCV.

[22]  Pushmeet Kohli,et al.  Exact inference in multi-label CRFs with higher order cliques , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[24]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jitendra Malik,et al.  Using contours to detect and localize junctions in natural images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[27]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[28]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[29]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[30]  Jitendra Malik,et al.  Recognizing Objects in Range Data Using Regional Point Descriptors , 2004, ECCV.

[31]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[32]  Franc Solina,et al.  Superquadrics for Segmenting and Modeling Range Data , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34]  Pieter Abbeel,et al.  A textured object recognition pipeline for color and depth image data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[35]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Cristian Sminchisescu,et al.  CPMC-3D-O2P: Semantic segmentation of RGB-D images using CPMC and Second Order Pooling , 2013, ArXiv.

[37]  Cristian Sminchisescu,et al.  Chebyshev approximations to the histogram χ2 kernel , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Cordelia Schmid,et al.  Segmentation Driven Object Detection with Fisher Vectors , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Sebastian Nowozin,et al.  Pottics - The Potts Topic Model for Semantic Image Segmentation , 2012, DAGM/OAGM Symposium.

[40]  David A. Forsyth,et al.  3D Object Recognition Using Invariance , 1995, Artif. Intell..

[41]  David A. Forsyth,et al.  Invariant Descriptors for 3D Object Recognition and Pose , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..