A deep representation for depth images from synthetic data

Convolutional Neural Networks (CNNs) trained on large scale RGB databases have become the secret sauce in the majority of recent approaches for object categorization from RGB-D data. Thanks to colorization techniques, these methods exploit the filters learned from 2D images to extract meaningful representations in 2.5D. Still, the perceptual signature of these two kind of images is very different, with the first usually strongly characterized by textures, and the second mostly by silhouettes of objects. Ideally, one would like to have two CNNs, one for RGB and one for depth, each trained on a suitable data collection, able to capture the perceptual properties of each channel for the task at hand. This has not been possible so far, due to the lack of a suitable depth database. This paper addresses this issue, proposing to opt for synthetically generated images rather than collecting by hand a 2.5D large scale database. While being clearly a proxy for real data, synthetic images allow to trade quality for quantity, making it possible to generate a virtually infinite amount of data. We show that the filters learned from such data collection, using the very same architecture typically used on visual data, learns very different filters, resulting in depth features (a) able to better characterize the different facets of depth images, and (b) complementary with respect to those derived from CNNs pre-trained on 2D datasets. Experiments on two publicly available databases show the power of our approach.

[1]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[2]  Kaiqi Huang,et al.  Convolutional Fisher Kernels for RGB-D Object Recognition , 2015, 2015 International Conference on 3D Vision.

[3]  Qi Tian,et al.  Good Practice in CNN Feature Transfer , 2016, ArXiv.

[4]  Kaiqi Huang,et al.  Semi-supervised learning and feature evaluation for RGB-D object recognition , 2015, Comput. Vis. Image Underst..

[5]  Ajmal S. Mian,et al.  Convolutional hypercube pyramid for accurate RGB-D object category and instance recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[9]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[12]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Gregory D. Hager,et al.  Beyond spatial pooling: Fine-grained representation learning in multiple domains , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[16]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[17]  Barbara Caputo,et al.  Online-batch strongly convex Multi Kernel Learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[20]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Philip M. Long,et al.  Benchmarking large-scale Fine-Grained Categorization , 2014, IEEE Winter Conference on Applications of Computer Vision.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[26]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[27]  Mohammed Bennamoun,et al.  Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Markus Schoeler,et al.  Semantic Pose Using Deep Networks Trained on Synthetic RGB-D , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  A. Khosla,et al.  A Deep Representation for Volumetric Shape Modeling , 2015 .