Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition

Most existing feature learning-based methods for RGB-D object recognition either combine RGB and depth data in an undifferentiated manner from the outset, or learn features from color and depth separately, which do not adequately exploit different characteristics of the two modalities or utilize the shared relationship between the modalities. In this paper, we propose a general CNN-based multi-modal learning framework for RGB-D object recognition. We first construct deep CNN layers for color and depth separately, which are then connected with a carefully designed multi-modal layer. This layer is designed to not only discover the most discriminative features for each modality, but is also able to harness the complementary relationship between the two modalities. The results of the multi-modal layer are back-propagated to update parameters of the CNN layers, and the multi-modal feature learning and the back-propagation are iteratively performed until convergence. Experimental results on two widely used RGB-D object datasets show that our method for general multi-modal learning achieves comparable performance to state-of-the-art methods specifically designed for RGB-D data.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Honggang Zhang,et al.  Web Multimedia Object Classification Using Cross-Domain Correlation Knowledge , 2013, IEEE Transactions on Multimedia.

[3]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[4]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[5]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[6]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[7]  Aly A. Farag,et al.  CSIFT: A SIFT Descriptor with Color Invariant Characteristics , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[9]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[10]  Heinrich H. Bülthoff,et al.  Going into depth: Evaluating 2D and 3D cues for object classification on a new, large-scale object dataset , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[11]  Yann LeCun,et al.  Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[12]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jiwen Lu,et al.  Large Margin Multi-metric Learning for Face and Kinship Verification in the Wild , 2014, ACCV.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[20]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Quoc V. Le,et al.  ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning , 2011, NIPS.

[23]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[24]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[26]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[27]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[28]  Nenghai Yu,et al.  Scale-Invariant Visual Language Modeling for Object Categorization , 2009, IEEE Trans. Multim..

[29]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[30]  Shuicheng Yan,et al.  Body Surface Context: A New Robust Feature for Action Recognition From Depth Videos , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[32]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[33]  Xiaogang Wang,et al.  Learning Semantic Signatures for 3D Object Retrieval , 2013, IEEE Transactions on Multimedia.

[34]  Dieter Fox,et al.  Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms , 2011, NIPS.

[35]  Yi Li,et al.  Convolutional Neural Networks for No-Reference Image Quality Assessment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[37]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[38]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[40]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[41]  Meng Wang,et al.  Deep Learning of Scene-Specific Classifier for Pedestrian Detection , 2014, ECCV.

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Chao-Hung Lin,et al.  Point Cloud Encoding for 3D Building Model Retrieval , 2014, IEEE Transactions on Multimedia.

[45]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[46]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[47]  D. T. Lee,et al.  Unsupervised Feature Learning for RGB-D Image Classification , 2014, ACCV.

[48]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[49]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[50]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[51]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[52]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[53]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.