Towards Fast and High-Quality Sign Language Production

Sign Language Production (SLP) aims to automatically translate a spoken language description to its corresponding sign language video. The core procedure of SLP is to transform sign gloss intermediaries into sign pose sequences (G2P). Most existing methods for G2P are based on sequential autoregression or sequence-to-sequence encoder-decoder learning. However, by generating target pose frames conditioned on the previously generated ones, these models are prone to bringing issues such as error accumulation and high inference latency. In this paper, we argue that such issues are mainly caused by adopting autoregressive manner. Hence, we propose a novel Non-AuToregressive (NAT) model with a parallel decoding scheme, as well as an External Aligner for sequence alignment learning. Specifically, we extract alignments from the external aligner by monotonic alignment search for gloss duration prediction, which is used by a length regulator to expand the source gloss sequence to match the length of the target sign pose sequence for parallel sign pose generation. Furthermore, we devise a spatial-temporal graph convolutional pose generator in the NAT model to generate smoother and more natural sign pose sequences. Extensive experiments conducted on PHOENIX14T dataset show that our proposed model outperforms state-of-the-art autoregressive models in terms of speed and quality.

[1]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[2]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[3]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[4]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[5]  Tunga Güngör,et al.  A Hybrid Translation System from Turkish Spoken Language to Turkish Sign Language , 2019, 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Shinichi Tamura,et al.  Recognition of sign language motion images , 1988, Pattern Recognit..

[8]  Ben Saunders,et al.  Adversarial Training for Multi-Channel Sign Language Production , 2020, BMVC.

[9]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[11]  Necati Cihan Camgoz,et al.  Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks , 2020, International Journal of Computer Vision.

[12]  Amanda Cardoso Duarte,et al.  Cross-modal Neural Sign Language Translation , 2019, ACM Multimedia.

[13]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[14]  Zhongfei Zhang,et al.  Multitask Non-Autoregressive Model for Human Motion Prediction , 2020, IEEE Transactions on Image Processing.

[15]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[16]  Shaikh Anowarul Fattah,et al.  Real-Time American Sign Language Recognition Using Skin Segmentation and Image Category Classification with Convolutional Neural Network and Deep Learning , 2018, TENCON 2018 - 2018 IEEE Region 10 Conference.

[17]  Rosalee Wolfe,et al.  An automated technique for real-time production of lifelike animations of American Sign Language , 2015, Universal Access in the Information Society.

[18]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[19]  S. Srihari Mixture Density Networks , 1994 .

[20]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ben Saunders,et al.  Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video , 2020, ArXiv.

[22]  Tie-Yan Liu,et al.  A Study of Non-autoregressive Model for Sequence Generation , 2020, ACL.

[23]  Ersin Yumer,et al.  MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics , 2018, ECCV.

[24]  Tian Xia,et al.  Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[28]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[30]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Hermann Hienz,et al.  Video-based continuous sign language recognition using statistical methods , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[32]  Lale Akarun,et al.  Isolated sign language recognition using Improved Dense Trajectories , 2016, 2016 24th Signal Processing and Communication Application Conference (SIU).

[33]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[34]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[35]  Necati Cihan Camgöz,et al.  Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[36]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[37]  Kostas Karpouzis,et al.  Educational resources and implementation of a Greek sign language synthesis architecture , 2007, Comput. Educ..

[38]  Dahua Lin,et al.  Convolutional Sequence Generation for Skeleton-Based Action Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Shuai Zhang,et al.  Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[41]  Qinkun Xiao,et al.  Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[42]  Sarah Ebling,et al.  SMILE Swiss German Sign Language Dataset , 2018, LREC.

[43]  Fenglin Liu,et al.  Non-Autoregressive Video Captioning with Iterative Refinement , 2019, ArXiv.

[44]  Tie-Yan Liu,et al.  Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation , 2020, IJCAI.

[45]  Richard Bowden,et al.  Sign Language Production using Neural Machine Translation and Generative Adversarial Networks , 2018, BMVC.

[46]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[47]  Zhou Zhao,et al.  FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire , 2020, ACM Multimedia.

[48]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[50]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[51]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.