RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion

Object recognition has many applications in human-machine interaction and multimedia retrieval. However, due to large intra-class variability and inter-class similarity, accurate recognition relying only on RGB data is still a big challenge. Recently, with the emergence of inexpensive RGB-D devices, this challenge can be better addressed by leveraging additional depth information. A very special yet important case of object recognition is hand-held object recognition, as manipulating objects with hands is common and intuitive in human-human and human-machine interactions. In this paper, we study this problem and introduce an effective framework to address it. This framework first detects and segments the hand-held object by exploiting skeleton information combined with depth information. In the object recognition stage, this work exploits heterogeneous features extracted from different modalities and fuses them to improve the recognition accuracy. In particular, we incorporate handcrafted and deep learned features and study several multi-step fusion variants. Experimental evaluations validate the effectiveness of the proposed method.

[1]  MalikJitendra,et al.  Indoor Scene Understanding with RGB-D Images , 2015 .

[2]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[5]  Luís A. Alexandre 3D Object Recognition Using Convolutional Neural Networks with Transfer Learning Between Input Channels , 2014, IAS.

[6]  Yasuo Kuniyoshi,et al.  Fast object detection for robots in a cluttered indoor environment using integral 3D feature table , 2011, 2011 IEEE International Conference on Robotics and Automation.

[7]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[10]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[11]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[12]  Yasuo Kuniyoshi,et al.  Voxelized Shape and Color Histograms for RGB-D , 2011, IROS 2011.

[13]  Anil A. Bharath,et al.  A dataset for Hand-Held Object Recognition , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[14]  Michael Beetz,et al.  Leaving Flatland: Toward real-time 3D navigation , 2009, 2009 IEEE International Conference on Robotics and Automation.

[15]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[16]  Pheng-Ann Heng,et al.  A Novel Feature Fusion Method Based on Partial Least Squares Regression , 2005, ICAPR.

[17]  Wen Gao,et al.  An interactive system of stereoscopic video conversion , 2012, ACM Multimedia.

[18]  Majid Mirmehdi,et al.  Text Line Aggregation , 2014, ICPRAM.

[19]  Shuang Wang,et al.  Combining heterogenous features for 3D hand-held object recognition , 2014, Photonics Asia.

[20]  Shuang Wang,et al.  Multiple Feature Fusion Based Hand-held Object Recognition with RGB-D data , 2014, ICIMCS '14.

[21]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[22]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[23]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[24]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[25]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.

[27]  Anil A. Bharath,et al.  Small Hand-held Object Recognition Test (SHORT) , 2014, IEEE Winter Conference on Applications of Computer Vision.

[28]  Zoltan-Csaba Marton,et al.  Hierarchical object geometric categorization and appearance classification for mobile manipulation , 2010, 2010 10th IEEE-RAS International Conference on Humanoid Robots.

[29]  Qingming Huang,et al.  Learning Hierarchical Semantic Description Via Mixed-Norm Regularization for Image Understanding , 2012, IEEE Transactions on Multimedia.

[30]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[31]  Markus Vincze,et al.  Ensemble of shape functions for 3D object classification , 2011, 2011 IEEE International Conference on Robotics and Biomimetics.

[32]  Yun Fu,et al.  Multiple feature fusion by subspace learning , 2008, CIVR '08.

[33]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..