Feature learning based on SAE-PCA network for human gesture recognition in RGBD images

Abstract Coming with the emerging of depth sensors link Microsoft Kinect, human hand gesture recognition has received ever increasing research interests recently. A successful gesture recognition system has usually heavily relied on having a good feature representation of data, which is expected to be task-dependent as well as coping with the challenges and opportunities induced by depth sensor. In this paper, a feature learning approach based on sparse auto-encoder (SAE) and principle component analysis is proposed for recognizing human actions, i.e. finger-spelling or sign language, for RGB-D inputs. The proposed model of feature learning is consisted of two components: First, features are learned respectively from the RGB and depth channels, using sparse auto-encoder with convolutional neural networks. Second, the learned features from both channels is concatenated and fed into a multiple layer PCA to get the final feature. Experimental results on American sign language (ASL) dataset demonstrate that the proposed feature learning model is significantly effective, which improves the recognition rate from 75% to 99.05% and outperforms the state-of-the-art.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[4]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2011, International Journal of Computer Vision.

[5]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[6]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Nicolas Pugeault,et al.  Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[8]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[9]  Xiaodong Yang,et al.  Histogram of 3D Facets: A characteristic descriptor for hand gesture recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Xindong Wu,et al.  3-D Object Retrieval With Hausdorff Distance Learning , 2014, IEEE Transactions on Industrial Electronics.

[13]  Yue Gao,et al.  Camera Constraint-Free View-Based 3-D Object Retrieval , 2012, IEEE Transactions on Image Processing.

[14]  John W. Sheppard,et al.  Deep Structure Learning: Beyond Connectionist Approaches , 2012, 2012 11th International Conference on Machine Learning and Applications.

[15]  S. Foo,et al.  Hand pose estimation for American sign language recognition , 2004, Thirty-Sixth Southeastern Symposium on System Theory, 2004. Proceedings of the.

[16]  Lale Akarun,et al.  Real time hand pose estimation using depth sensors , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[17]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Guillermo Cámara Chávez,et al.  Finger Spelling Recognition from RGB-D Information Using Kernel Descriptor , 2013, SIBGRAPI.

[20]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[21]  Wen Gao,et al.  Learning to Distribute Vocabulary Indexing for Scalable Visual Search , 2013, IEEE Transactions on Multimedia.

[22]  Yan Liu,et al.  Latent feature learning in social media network , 2013, ACM Multimedia.

[23]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[24]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Wen Gao,et al.  Towards semantic embedding in visual vocabulary , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Qi Tian,et al.  Task-Dependent Visual-Codebook Compression , 2012, IEEE Transactions on Image Processing.

[27]  W. R. Schwartz,et al.  Sign Language Recognition using Partial Least Squares and RGB-D Information , 2013 .

[28]  Qi Tian,et al.  Less is More: Efficient 3-D Object Retrieval With Query View Selection , 2011, IEEE Transactions on Multimedia.

[29]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[30]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[31]  Lale Akarun,et al.  Randomized decision forests for static and dynamic hand shape classification , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[32]  L. Van Gool,et al.  Combining RGB and ToF cameras for real-time 3D hand gesture interaction , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[33]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[35]  Xing Xie,et al.  Mining city landmarks from blogs by graph modeling , 2009, ACM Multimedia.

[36]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[37]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.