RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features

Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural networks (CNN) that are pre-trained for image categorization and provide a rich, semantically meaningful feature set. We incorporate depth information, which the CNN was not trained with, by rendering objects from a canonical perspective and colorizing the depth channel according to distance from the object center. We evaluate our approach on the Washington RGB-D Objects dataset, where we find that the generated feature set naturally separates classes and instances well and retains pose manifolds. We outperform state-of-the-art on a number of subtasks and show that our approach can yield superior results when only little training data is available.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[3]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Dieter Fox,et al.  A Scalable Tree-Based Approach for Joint Object and Pose Recognition , 2011, AAAI.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation (Lecture Notes in Computer Science) , 2003 .

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[10]  Sven Behnke,et al.  Real-Time Plane Segmentation Using RGB-D Cameras , 2012, RoboCup.

[11]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[12]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[13]  Dani Lischinski,et al.  Colorization using optimization , 2004, ACM Trans. Graph..

[14]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[15]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[16]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation , 2003, Lecture Notes in Computer Science.

[19]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[20]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.