SimulSLT: End-to-End Simultaneous Sign Language Translation

Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years. However, the existing sign language translation methods need to read all the videos before starting the translation, which leads to a high inference latency and also limits their application in real-life scenarios. To solve this problem, we propose SimulSLT, the first end-to-end simultaneous sign language translation model, which can translate sign language videos into target text concurrently. SimulSLT is composed of a text decoder, a boundary predictor, and a masked encoder. We 1) use the wait-k strategy for simultaneous translation. 2) design a novel boundary predictor based on the integrate-and-fire module to output the gloss boundary, which is used to model the correspondence between the sign language video and the gloss. 3) propose an innovative re-encode method to help the model obtain more abundant contextual information, which allows the existing video features to interact fully. The experimental results conducted on the RWTH-PHOENIX-Weather 2014T dataset show that SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model while maintaining low latency, which proves the effectiveness of our method.

[1]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[3]  Oscar Koller,et al.  MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language , 2018, BMVC.

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Xilin Chen,et al.  Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition , 2016, ECCV.

[6]  Wei Li,et al.  Monotonic Infinite Lookback Attention for Simultaneous Machine Translation , 2019, ACL.

[7]  Jesse Read,et al.  Better Sign Language Translation with STMC-Transformer , 2020, COLING.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Shinichi Tamura,et al.  Recognition of sign language motion images , 1988, Pattern Recognit..

[11]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Juan Pino,et al.  SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation , 2020, AACL.

[14]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[15]  Anthony N. Burkitt,et al.  A Review of the Integrate-and-fire Neuron Model: I. Homogeneous Synaptic Input , 2006, Biological Cybernetics.

[16]  Haifeng Wang,et al.  STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework , 2018, ACL.

[17]  Alexei Medvedev The Mouth Articulatory Modelling and Phonosemantic Conceptualization as In-Formation of Human Language , 2008, EJC.

[18]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[19]  Xin Yu,et al.  Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[21]  Kyunghyun Cho,et al.  Can neural machine translation do simultaneous translation? , 2016, ArXiv.

[22]  Juan Pino,et al.  Monotonic Multihead Attention , 2019, ICLR.

[23]  Wolfgang Maass,et al.  Networks of Spiking Neurons: The Third Generation of Neural Network Models , 1996, Electron. Colloquium Comput. Complex..

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  Anthony N. Burkitt,et al.  A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties , 2006, Biological Cybernetics.

[27]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[28]  Hiroshi Sakou,et al.  Sign Language Translation System Using Continuous DP Matching , 1992, MVA.

[29]  Houqiang Li,et al.  Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Zhou Zhao,et al.  FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire , 2020, ACM Multimedia.

[31]  Zhou Zhao,et al.  Towards Fast and High-Quality Sign Language Production , 2021, ACM Multimedia.

[32]  Tie-Yan Liu,et al.  SimulSpeech: End-to-End Simultaneous Speech to Text Translation , 2020, ACL.

[33]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[35]  Dan Guo,et al.  Parallel Temporal Encoder For Sign Language Translation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[36]  Houqiang Li,et al.  A Threshold-based HMM-DTW Approach for Continuous Sign Language Recognition , 2014, ICIMCS '14.

[37]  Hermann Ney,et al.  Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs , 2018, International Journal of Computer Vision.

[38]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[39]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[41]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[42]  Lale Akarun,et al.  Neural Sign Language Translation by Learning Tokenization , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[43]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[44]  Dongxu Li,et al.  TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation , 2020, NeurIPS.

[45]  Zhou Zhao,et al.  Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation , 2021, ACM Multimedia.

[46]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[47]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[49]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.