Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognition

Acquiring spatio-temporal states of an action is the most crucial step for action classification. In this paper, we propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action. MFFs can be used as input to any deep learning architecture with very little modification on the network. We evaluate MFFs on hand gesture recognition tasks using three video datasets - Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets - which require capturing long-term temporal relations of hand movements. Our approach obtains very competitive performance on Jester and ChaLearn benchmarks with the classification accuracies of 96.28% and 57.4%, respectively, while achieving state-of-the-art performance with 84.7% accuracy on NVIDIA benchmark.

[1]  Pichao Wang,et al.  Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergio Escalera,et al.  A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[6]  Mohan M. Trivedi,et al.  Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations , 2014, IEEE Transactions on Intelligent Transportation Systems.

[7]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Xin Xu,et al.  Multimodal Gesture Recognition Based on the ResC3D Network , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Xin Xu,et al.  Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[13]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[14]  Juan Song,et al.  Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM , 2017, IEEE Access.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Pichao Wang,et al.  Large-Scale Multimodal Gesture Recognition Using Heterogeneous Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[18]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[19]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jun Wan,et al.  A Unified Framework for Multi-Modal Isolated Gesture Recognition , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[23]  Juan Song,et al.  Learning Spatiotemporal Features Using 3DCNN and Convolutional LSTM for Gesture Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[25]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[26]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[28]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[29]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[30]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Juan Song,et al.  Large-scale Isolated Gesture Recognition using pyramidal 3D convolutional networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33]  Pichao Wang,et al.  Large-scale Isolated Gesture Recognition using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[34]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[35]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.