Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate. Sign language production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a mixture density network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions.

[1]  Haoran Li,et al.  Music-oriented Dance Video Synthesis with Pose Perceptual Loss , 2019, ArXiv.

[2]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[3]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[4]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[5]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Wei Zhan,et al.  Probabilistic Prediction of Vehicle Semantic Intention and Motion , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[7]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[8]  Ben Saunders,et al.  Adversarial Training for Multi-Channel Sign Language Production , 2020, BMVC.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Stefan Riezler,et al.  Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[11]  John A. Albertini,et al.  Deafness and Hearing Loss , 2010 .

[12]  Margriet Verlinden,et al.  SYNTHETIC SIGNING FOR THE DEAF : eSIGN , 2005 .

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Matt Huenerfauth,et al.  Bridging the gap between sign language machine translation and sign language animation using sequence classification , 2015, SLPAT@Interspeech.

[15]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Sebastian Nowozin,et al.  Deep Directional Statistics: Pose Estimation with Uncertainty Quantification , 2018, ECCV.

[18]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[20]  Hermann Hienz,et al.  Video-based continuous sign language recognition using statistical methods , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[21]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[22]  Judith A. Holt Stanford Achievement Test—8th Edition: Reading Comprehension Subgroup Results , 1993 .

[23]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Tunga Güngör,et al.  A Hybrid Translation System from Turkish Spoken Language to Turkish Sign Language , 2019, 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[25]  Tie-Yan Liu,et al.  Adversarial Neural Machine Translation , 2017, ACML.

[26]  Kevin Lin,et al.  Adversarial Ranking for Language Generation , 2017, NIPS.

[27]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[29]  W. Stokoe Sign Language Structure , 1980 .

[30]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[31]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[32]  Alexis Héloir,et al.  Assessing the deaf user perspective on sign language avatars , 2011, ASSETS.

[33]  Alexis Héloir,et al.  Sign Language Avatars: Animation and Comprehensibility , 2011, IVA.

[34]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[36]  Nicolas Pugeault,et al.  Sign language recognition using sub-units , 2012, J. Mach. Learn. Res..

[37]  Petros Daras,et al.  A Comprehensive Study on Sign Language Recognition Methods , 2020, ArXiv.

[38]  Partha Pratim Roy,et al.  Predicting Video-frames Using Encoder-convlstm Combination , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Shinichi Tamura,et al.  Recognition of sign language motion images , 1988, Pattern Recognit..

[40]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[41]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[42]  Qinkun Xiao,et al.  Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[43]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Junichi Yamagishi,et al.  An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[46]  S. Srihari Mixture Density Networks , 1994 .

[47]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[49]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[52]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Tinne Tuytelaars,et al.  Mixture Dense Regression for Object Detection and Human Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[56]  Ceil Lucas,et al.  Linguistics of American Sign Language: An Introduction , 1995 .

[57]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  John Glauert,et al.  VANESSA - A system for communication between deaf and hearing people , 2006 .

[59]  Lale Akarun,et al.  Isolated sign language recognition using Improved Dense Trajectories , 2016, 2016 24th Signal Processing and Communication Application Conference (SIU).

[60]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[61]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[62]  Rosalee Wolfe,et al.  An automated technique for real-time production of lifelike animations of American Sign Language , 2015, Universal Access in the Information Society.

[63]  Lior Wolf,et al.  Language Generation with Recurrent Generative Adversarial Networks without Pre-training , 2017, ArXiv.

[64]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[65]  Matt Huenerfauth,et al.  Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research , 2010, SLPAT@NAACL.

[66]  Necati Cihan Camgöz,et al.  Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[67]  Zhe Gan,et al.  Generating Text via Adversarial Training , 2016 .

[68]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Qi Ye,et al.  Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network , 2017, ECCV.

[70]  Roland Pfau,et al.  Nonmanuals: their grammatical and prosodic roles , 2010 .

[71]  Mark Wells,et al.  Tessa, a system to aid communication with deaf people , 2002, ASSETS.

[72]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Hermann Ney,et al.  Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[75]  Andreas Stafylopatis,et al.  Statistical Machine Translation for Greek to Greek Sign Language Using Parallel Corpora Produced via Rule-Based Machine Translation , 2018, CIMA@ICTAI.

[76]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[77]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[78]  Mike Schuster,et al.  Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks , 1999, NIPS.

[79]  Thomas Brox,et al.  Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[81]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[83]  Chen Li,et al.  Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Lale Akarun,et al.  Neural Sign Language Translation by Learning Tokenization , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[86]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[87]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[88]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[89]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[90]  Ian Marshall,et al.  Linguistic modelling and language-processing technologies for Avatar-based sign language presentation , 2008, Universal Access in the Information Society.

[91]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[92]  Hermann Ney,et al.  Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather , 2014, LREC.

[93]  Necati Cihan Camgoz,et al.  Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks , 2020, International Journal of Computer Vision.

[94]  Jan Zelinka,et al.  Neural Sign Language Synthesis: Words Are Our Glosses , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[95]  Jesse Read,et al.  Attention is All You Sign: Sign Language Translation with Transformers , 2020 .

[96]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[97]  Meredith Ringel Morris,et al.  Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective , 2019, ASSETS.

[98]  Wei Chen,et al.  Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets , 2017, NAACL.

[99]  Yoshua Bengio,et al.  Drawing and Recognizing Chinese Characters with Recurrent Neural Network , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.