Neural Sign Language Translation

Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T1. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.

[1]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[2]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[7]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[8]  Hermann Ney,et al.  RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus , 2012, LREC.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Richard Bowden,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Richard Bowden,et al.  Sign Language Recognition , 2011, Visual Analysis of Humans.

[12]  Xilin Chen,et al.  Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition , 2016, ECCV.

[13]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[14]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[15]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hermann Ney,et al.  Using viseme recognition to improve a sign language translation system , 2013, IWSLT.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Harold L. Somers,et al.  Building a sign language corpus for use in machine translation , 2010 .

[20]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[24]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[25]  Lale Akarun,et al.  BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains , 2016, LREC.

[26]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[27]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[30]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jordan Fenlon,et al.  Building the British Sign Language Corpus , 2013 .

[32]  Hermann Ney,et al.  Analysis, preparation, and optimization of statistical sign language machine translation , 2012, Machine Translation.

[33]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[34]  Thomas Hanke,et al.  DGS Corpus & Dicta-Sign: The Hamburg Studio Setup , 2010 .

[35]  Andrew Zisserman,et al.  Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[36]  Nicolas Pugeault,et al.  Sign language recognition using sub-units , 2012, J. Mach. Learn. Res..

[37]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[38]  Hermann Ney,et al.  Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather , 2014, LREC.

[39]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[41]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[42]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[43]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[45]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Nicolas Pugeault,et al.  Sign Language Recognition using Sequential Pattern Trees , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[49]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[51]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[53]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[54]  W. Stokoe Sign Language Structure , 1980 .

[55]  Graham Neubig,et al.  Neural Machine Translation and Sequence-to-sequence Models: A Tutorial , 2017, ArXiv.

[56]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[58]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[59]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[60]  Stan Sclaroff,et al.  Sign Language Spotting with a Threshold Model Based on Conditional Random Fields , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.