One-shot learning for RGB-D hand-held object recognition

With the advance of computer technology and smart device, many applications, such as face recognition and object recognition, have been developed to facilitate human-computer interaction (HCI) efficiently. In this respect, the hand-held object recognition plays an important role in HCI. It can be used not only to help computer understand useros intentions but also to meet useros requirements. In recent years the appearance of convolutional neural networks (CNNs) greatly enhances the performance of object recognition and this technology has been applied to hand-held object recognition in some works. However, these supervised learning models need large number of labelled data and many iterations to train their large number of parameters. This is a huge challenge for HCI, because HCI need to deal with in-time and itos difficult to collect enough labeled data. Especially when a new category need to be learnt, it will spend a lot of time to update the model. In this work, we adopt the one-shot learning method to solve this problem. This method does not need to update the model when a new category need to be learnt. Moreover, depth image is robust to light and color variation. We fuse depth image information to harness the complementary relationship between the two modalities to improve the performance of hand-held object recognition. Experimental results on our handheld object dataset demonstrate that our method for hand-held object recognition achieves an improvement of performance.

[1]  Luis Herranz,et al.  Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs , 2017, AAAI.

[2]  Xue Li,et al.  Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition , 2016, Multimedia Tools and Applications.

[3]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[4]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[9]  Xinhang Song,et al.  RGB-D Scene Recognition with Object-to-Object Relation , 2017, ACM Multimedia.

[10]  Gang Wang,et al.  Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition , 2015, IEEE Transactions on Multimedia.

[11]  Xin Zhao,et al.  Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Shuang Wang,et al.  RGB-D Hand-Held Object Recognition Based on Heterogeneous Feature Fusion , 2015, Journal of Computer Science and Technology.

[14]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[15]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Davide Maltoni,et al.  CORe50: a New Dataset and Benchmark for Continuous Object Recognition , 2017, CoRL.

[17]  Xinhang Song,et al.  Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold , 2017, IEEE Transactions on Image Processing.

[18]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Jesse S. Jin,et al.  Individual Object Interaction for Camera Control and Multimedia Synchronization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.