Learning Rich Features from RGB-D Images for Object Detection and Segmentation

In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  Yali Amit,et al.  Joint Induction of Shape Features and Tree Classifiers , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[7]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[9]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[10]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[11]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Xiaofeng Ren,et al.  Discriminatively Trained Sparse Code Gradients for Contour Detection , 2012, NIPS.

[14]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[15]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[16]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  James M. Keller,et al.  Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[18]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nick Barnes,et al.  Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Silvio Savarese,et al.  Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jitendra Malik,et al.  Object Detection in RGB-D Indoor Scenes 1 , 2013 .

[23]  Cristian Sminchisescu,et al.  CPMC-3D-O2P: Semantic segmentation of RGB-D images using CPMC and Second Order Pooling , 2013, ArXiv.

[24]  Joseph J. Lim,et al.  Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[26]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Abhinav Gupta,et al.  Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instances and Occlusion Ordering , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[36]  C. Lawrence Zitnick,et al.  Fast Edge Detection Using Structured Forests , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.