Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules

Gesture is a natural interface in interacting with wearable devices such as VR/AR helmet and glasses. The main challenge of gesture recognition in egocentric vision arises from the global camera motion caused by the spontaneous head movement of the device wearer. In this paper, we address the problem by a novel recurrent 3D convolutional neural network for end-to-end learning. We specially design a spatiotemporal transformer module with recurrent connections between neighboring time slices which can actively transform a 3D feature map into a canonical view in both spatial and temporal dimensions. To validate our method, we introduce a new dataset with sufficient size, variation and reality, which contains 83 gestures designed for interaction with wearable devices, and more than 24,000 RGB-D gesture samples from 50 subjects captured in 6 scenes. On this dataset, we show that the proposed network outperforms competing state-of-the-art algorithms. Moreover, our method can achieve state-of-the-art performance on the challenging GTEA egocentric action dataset.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Luca Benini,et al.  Gesture Recognition in Ego-centric Videos Using Dense Trajectories and Hand Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[5]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[6]  Oscar Koller,et al.  Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[7]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[9]  Yichao Huang,et al.  A Pointing Gesture Based Egocentric Interaction System: Dataset, Approach and Application , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[12]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[13]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Marc'Aurelio Ranzato,et al.  Transformation-Based Models of Video Sequences , 2017, ArXiv.

[17]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[20]  C. V. Jawahar,et al.  Trajectory aligned features for first person action recognition , 2016, Pattern Recognit..

[21]  Mohan M. Trivedi,et al.  Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations , 2014, IEEE Transactions on Intelligent Transportation Systems.

[22]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[24]  Pavlo Molchanov,et al.  Hand gesture recognition with 3D convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Pichao Wang,et al.  Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[27]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Ole Winther,et al.  Recurrent Spatial Transformer Networks , 2015, ArXiv.

[29]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[30]  Bo Huang,et al.  Toward End-to-End Face Recognition Through Alignment Learning , 2017, IEEE Signal Processing Letters.

[31]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yang Gao,et al.  Multi-layered gesture recognition with Kinect , 2015, J. Mach. Learn. Res..

[33]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.