TMMF: Temporal Multi-Modal Fusion for Single-Stage Continuous Gesture Recognition

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.

[1]  Xilin Chen,et al.  Continuous Gesture Recognition with Hand-Oriented Spatiotemporal Feature , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[2]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Jiahuan Zhou,et al.  Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition , 2017, IEEE Transactions on Image Processing.

[5]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[7]  Pavlo Molchanov,et al.  Hand gesture recognition with 3D convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Richard Bowden,et al.  Particle Filter Based Probabilistic Forced Alignment for Continuous Gesture Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[9]  Soo-Hyung Kim,et al.  Continuous Hand Gesture Spotting and Classification Using 3D Finger Joints Information , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[10]  Jianping Fan,et al.  Deep Spatial and Temporal Network for Robust Visual Object Tracking , 2020, IEEE Transactions on Image Processing.

[11]  Keiji Yanai,et al.  IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[12]  Jinhui Tang,et al.  Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation , 2015, ACM Multimedia.

[13]  Sharath Pankanti,et al.  Hand tracking by binary quadratic programming and its application to retail activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Amirreza Shaban,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sridha Sridharan,et al.  Predicting the Future: A Jointly Learnt Model for Action Anticipation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Sergio Escalera,et al.  ChaLearn Looking at People: IsoGD and ConGD Large-scale RGB-D Gesture Recognition , 2020, IEEE transactions on cybernetics.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[22]  Sridha Sridharan,et al.  Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Ahmet Gunduz,et al.  Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[26]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Yifan Zhang,et al.  Gesture Recognition Using Spatiotemporal Deformable Convolutional Representation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[28]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[29]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jinhui Tang,et al.  Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[31]  Shiyin Qin,et al.  One-shot learning gesture recognition based on joint training of 3D ResNet and memory module , 2019, Multimedia Tools and Applications.

[32]  L. Javier García-Villalba,et al.  Improving Real-Time Hand Gesture Recognition with Semantic Segmentation , 2021, Sensors.

[33]  Pichao Wang,et al.  Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[34]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yifan Zhang,et al.  Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Juan Song,et al.  Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM , 2019, IEEE Transactions on Multimedia.

[38]  Sridha Sridharan,et al.  Fine-grained Action Segmentation using the Semi-Supervised Action GAN , 2019, Pattern Recognit..

[39]  Mieke Van Herreweghe,et al.  Gesture and Sign Language Recognition with Temporal Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[40]  Hanqing Lu,et al.  EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition , 2018, IEEE Transactions on Multimedia.

[41]  Gang Hua,et al.  Dynamic hand gesture recognition: An exemplar-based approach from motion divergence fields , 2012, Image Vis. Comput..

[42]  Nanning Zheng,et al.  EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks , 2020, IEEE Transactions on Image Processing.

[43]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pavlo Molchanov,et al.  Multi-sensor system for driver's hand-gesture recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[46]  Jun Wan,et al.  Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Vishal M. Patel,et al.  Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yi Wu,et al.  Gesture recognition based on deep deformable 3D convolutional neural networks , 2020, Pattern Recognit..

[49]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Pichao Wang,et al.  Large-Scale Multimodal Gesture Segmentation and Recognition Based on Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[51]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.