Adversarial Training for Multi-Channel Sign Language Production

Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean. In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns. We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, and report state-of-the art SLP back-translation performance for manual production. We set new benchmarks for the production of multi-channel sign to underpin future research into realistic SLP.

[1]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Constantine Stephanidis,et al.  Universal access in the information society , 1999, HCI.

[3]  Jorma Laaksonen,et al.  Head Pose Estimation for Sign Language Video , 2013, SCIA.

[4]  Ben Saunders,et al.  Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[5]  Haoran Li,et al.  Music-oriented Dance Video Synthesis with Pose Perceptual Loss , 2019, ArXiv.

[6]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[8]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[9]  Qinkun Xiao,et al.  Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[10]  Margriet Verlinden,et al.  SYNTHETIC SIGNING FOR THE DEAF : eSIGN , 2005 .

[11]  Richard Bowden,et al.  Sign Language Production using Neural Machine Translation and Generative Adversarial Networks , 2018, BMVC.

[12]  Ronnie B. Wilbur,et al.  Phonological and prosodic layering of nonmanuals in American Sign Language. , 2000 .

[13]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[14]  W. Stokoe Sign Language Structure , 1980 .

[15]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Siome Goldenstein,et al.  Facial movement analysis in ASL , 2007, Universal Access in the Information Society.

[17]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[18]  Necati Cihan Camgoz,et al.  Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks , 2020, International Journal of Computer Vision.

[19]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[20]  Ceil Lucas,et al.  Linguistics of American Sign Language: An Introduction , 1995 .

[21]  Tunga Güngör,et al.  A Hybrid Translation System from Turkish Spoken Language to Turkish Sign Language , 2019, 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[22]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[24]  Kevin Lin,et al.  Adversarial Ranking for Language Generation , 2017, NIPS.

[25]  Shinichi Tamura,et al.  Recognition of sign language motion images , 1988, Pattern Recognit..

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Mark Wells,et al.  Tessa, a system to aid communication with deaf people , 2002, ASSETS.

[28]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29]  John C. McDonald,et al.  Exploring Localization for Mouthings in Sign Language Avatars , 2018 .

[30]  Stefan Riezler,et al.  Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[31]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[32]  Alexis Héloir,et al.  Sign Language Avatars: Animation and Comprehensibility , 2011, IVA.

[33]  Jesse Read,et al.  Attention is All You Sign: Sign Language Translation with Transformers , 2020 .

[34]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[35]  Hermann Ney,et al.  Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Andreas Stafylopatis,et al.  Statistical Machine Translation for Greek to Greek Sign Language Using Parallel Corpora Produced via Rule-Based Machine Translation , 2018, CIMA@ICTAI.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[40]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[42]  Mark Wells,et al.  Virtual signing: capture, animation, storage and transmission-an overview of the ViSiCAST project , 2000 .

[43]  Tie-Yan Liu,et al.  Adversarial Neural Machine Translation , 2017, ACML.

[44]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[45]  Wei Chen,et al.  Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets , 2017, NAACL.

[46]  Meredith Ringel Morris,et al.  Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective , 2019, ASSETS.

[47]  Jan Zelinka,et al.  Neural Sign Language Synthesis: Words Are Our Glosses , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Petros Maragos,et al.  Unsupervised classification of extreme facial events using active appearance models tracking for sign language videos , 2012, 2012 19th IEEE International Conference on Image Processing.

[49]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[50]  Alexis Héloir,et al.  Assessing the deaf user perspective on sign language avatars , 2011, ASSETS.

[51]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[52]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[53]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[54]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[56]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[57]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[58]  Roland Pfau,et al.  Nonmanuals: their grammatical and prosodic roles , 2010 .

[59]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[60]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[62]  Zhe Gan,et al.  Generating Text via Adversarial Training , 2016 .

[63]  Matt Huenerfauth,et al.  Data-Driven Synthesis of Spatially Inflected Verbs for American Sign Language Animation , 2011, TACC.

[64]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[65]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[66]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.