Gesture and Sign Language Recognition with Temporal Residual Networks

Gesture and sign language recognition in a continuous video stream is a challenging task, especially with a large vocabulary. In this work, we approach this as a framewise classification problem. We tackle it using temporal convolutions and recent advances in the deep learning field like residual networks, batch normalization and exponential linear units (ELUs). The models are evaluated on three different datasets: the Dutch Sign Language Corpus (Corpus NGT), the Flemish Sign Language Corpus (Corpus VGT) and the ChaLearn LAP RGB-D Continuous Gesture Dataset (ConGD). We achieve a 73.5% top-10 accuracy for 100 signs with the Corpus NGT, 56.4% with the Corpus VGT and a mean Jaccard index of 0.316 with the ChaLearn LAP ConGD without the usage of depth maps.

[1]  Sergio Escalera,et al.  ChaLearn Joint Contest on Multimedia Challenges Beyond Visual Analysis: An overview , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[2]  Sergio Escalera,et al.  Results and Analysis of ChaLearn LAP Multi-modal Isolated and Continuous Gesture Recognition, and Real Versus Fake Expressed Emotions Challenges , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jovan Popović,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH 2009.

[7]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Mieke Van Herreweghe,et al.  Sign classification in sign language Corpora with deep neural networks , 2016, LREC 2016.

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Andrew Zisserman,et al.  Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[14]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Ming C. Leu,et al.  American Sign Language word recognition with a sensory glove using artificial neural networks , 2011, Eng. Appl. Artif. Intell..

[16]  Petros Maragos,et al.  Multimodal gesture recognition via multiple hypotheses rescoring , 2015, J. Mach. Learn. Res..

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Richard Bowden,et al.  A boosted classifier tree for hand shape detection , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[22]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[23]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[24]  Andrew Zisserman,et al.  Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[25]  Fei Yang,et al.  Non-manual grammatical marker recognition based on multi-scale, spatio-temporal analysis of head pose and facial expressions , 2014, Image Vis. Comput..

[26]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Myriam Vermeerbergen,et al.  Het Corpus VGT. Een digitaal open access corpus van videos and annotaties van Vlaamse Gebarentaal, ontwikkeld aan de Universiteit Gent ism KU Leuven. , 2015 .

[29]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[30]  Onno Crasborn,et al.  The Corpus NGT: An online corpus for professionals and laymen , 2008 .