Skeleton Aware Multi-modal Sign Language Recognition

Sign language is commonly used by deaf or speech impaired people to communicate but requires significant effort to master. Sign Language Recognition (SLR) aims to bridge the gap between sign language users and others by recognizing signs from given videos. It is an essential yet challenging task since sign language is performed with the fast and complex movement of hand gestures, body posture, and even facial expressions. Recently, skeleton-based action recognition attracts increasing attention due to the independence between the subject and background variation. However, skeleton-based SLR is still under exploration due to the lack of annotations on hand keypoints. Some efforts have been made to use hand detectors with pose estimators to extract hand key points and learn to recognize sign language via Neural Networks, but none of them outperforms RGB-based methods. To this end, we propose a novel Skeleton Aware Multi-modal SLR framework (SAM-SLR) to take advantage of multi-modal information towards a higher recognition rate. Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics and a novel Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. RGB and depth modalities are also incorporated and assembled into our framework to provide global information that is complementary to the skeleton-based methods SL-GCN and SSTCN. As a result, SAM-SLR achieves the highest performance in both RGB (98.42%) and RGB-D (98.53%) tracks in 2021 Looking at People Large Scale Signer Independent Isolated SLR Challenge. Our code is available at https://github.com/jackyjsy/CVPR21Chal-SLR

[1]  Hongsong Wang,et al.  Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Zhang Zhang,et al.  Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition , 2020, ACM Multimedia.

[3]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jin Xu,et al.  Whole-Body Human Pose Estimation in the Wild , 2020, ECCV.

[5]  Quan Yang,et al.  Chinese sign language recognition based on video sequence appearance modeling , 2010, 2010 5th IEEE Conference on Industrial Electronics and Applications.

[6]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[8]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[9]  Sergio Escalera,et al.  ChaLearn LAP Large Scale Signer Independent Isolated Sign Language Recognition Challenge: Design, Results and Future Research , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[12]  Yun Fu,et al.  Generative Multi-View Human Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Hacer Yalim Keles,et al.  Isolated Sign Language Recognition with Multi-scale Features using LSTM , 2019, 2019 27th Signal Processing and Communications Applications Conference (SIU).

[14]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Mei-Chen Yeh,et al.  Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Yutaka Satoh,et al.  Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? , 2020, ArXiv.

[19]  Songül Albayrak,et al.  A Kinect based sign language recognition system using spatio-temporal features , 2013, Other Conferences.

[20]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[23]  Rama Chellappa,et al.  Cross-View Action Recognition via Transferable Dictionary Learning , 2016, IEEE Transactions on Image Processing.

[24]  Trevor Darrell,et al.  Learning with Side Information through Modality Hallucination , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[29]  Ceil Lucas,et al.  Linguistics of American Sign Language: An Introduction , 1995 .

[30]  Dong Xu,et al.  Dividing and Aggregating Network for Multi-view Action Recognition , 2018, ECCV.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  R. Holtz Reading Between the Signs: Intercultural Communication for Sign Language Interpreters , 2014 .

[33]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  K. Emmorey Language, Cognition, and the Brain: Insights From Sign Language Research , 2001 .

[35]  Jian-Huang Lai,et al.  Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[36]  Christian Wolf,et al.  Human Action Recognition: Pose-Based Attention Draws Focus to Hands , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[37]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[38]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[40]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Livio Pinto,et al.  Calibration of Kinect for Xbox One and Comparison between the Two Generations of Microsoft Sensors , 2015, Sensors.

[42]  Yuan Li,et al.  Deep attention network for joint hand gesture localization and recognition using static RGB-D images , 2018, Inf. Sci..

[43]  Yue Zhao,et al.  PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities , 2018, ECCV.

[44]  Hermann Ney,et al.  Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs , 2018, International Journal of Computer Vision.

[45]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Changshui Zhang,et al.  A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training , 2019, IEEE Transactions on Multimedia.

[47]  Yifan Zhang,et al.  Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition , 2020, ECCV.

[48]  Nicolas D. Georganas,et al.  Real-Time Hand Gesture Detection and Recognition Using Bag-of-Features and Support Vector Machine Techniques , 2011, IEEE Transactions on Instrumentation and Measurement.

[49]  Kui Jia,et al.  JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[50]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks , 2019, IEEE Transactions on Image Processing.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[53]  Adam Schembri,et al.  Australian Sign Language: Auslan: An Introduction to Sign Language Linguistics , 2007 .

[54]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[56]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[57]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Houqiang Li,et al.  Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[59]  Gregory Shakhnarovich,et al.  American Sign Language Fingerspelling Recognition in the Wild , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[60]  Cleber Zanchettin,et al.  Spatial-Temporal Graph Convolutional Networks for Sign Language Recognition , 2019, ICANN.

[61]  Hacer Yalim Keles,et al.  Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Streams , 2019, IEEE EUROCON 2019 -18th International Conference on Smart Technologies.

[62]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[63]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.

[64]  Ferdinand Fuhrmann,et al.  EVALUATION OF THE SPATIAL RESOLUTION ACCURACY OF THE FACE TRACKING SYSTEM FOR KINECT FOR WINDOWS V 1 AND V 2 , 2014 .

[65]  Shing Chiang Tan,et al.  Isolated sign language recognition using Convolutional Neural Network hand modelling and Hand Energy Image , 2019, Multimedia Tools and Applications.

[66]  Hacer Yalim Keles,et al.  AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods , 2020, IEEE Access.

[67]  Jiayu Zhou,et al.  Missing Modalities Imputation via Cascaded Residual Autoencoder , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Xin Yu,et al.  Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[69]  Qinkun Xiao,et al.  Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[70]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.