Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation

Sign language translation aims at directly translating a sign language video into a natural sentence. The majority of existing methods take the video-sentence pairs labeled by multiple specific signers as training and testing samples. However, such setting does not fit in with the real-world applications. A practicable sign language translation system is supposed to provide accurate translation results for unseen signers. In this paper, we mainly attack the signer-independent setting and focus on augmenting the generalization ability of translation model. To adapt to the challenging setting, we propose a novel framework called contrastive disentangled meta-learning (CDM), which develops several improvements in both deep architecture and training mode. Specifically, based on the minimax entropy objective, a disentangled module with adaptive gated units is developed to decouple the signer-specific and task-specific representation in the encoder. Besides, we facilitate the frame-word alignments by leveraging contrastive constraints between the obtained task-specific representation and the decoding output. The disentangled and contrastive modules could provide complementary information for each other. As for the training mode, we encourage the model to perform well in the simulated signer-independent scenarios by finding the generalized learning directions in the meta-learning process. Considering that vanilla meta-learning methods utilize the multiple specific signers insufficiently, we adopt a fine-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on the benchmark dataset RWTH-PHOENIX-Weather-2014T(PHOENIX14T) show that CDM could achieve competitive results compared with the state-of-the-art methods.

[1]  Zhongfei Zhang,et al.  TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.

[2]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Dongxu Li,et al.  TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation , 2020, NeurIPS.

[5]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[6]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[7]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[8]  Yingming Li,et al.  Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning , 2019, EMNLP.

[9]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[10]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[11]  Siddhartha Chaudhuri,et al.  Generalizing Across Domains via Cross-Gradient Training , 2018, ICLR.

[12]  Swami Sankaranarayanan,et al.  MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[13]  Richard Bowden,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[15]  Alex ChiChung Kot,et al.  Domain Generalization with Adversarial Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Xinyu Jin,et al.  Selective Transfer With Reinforced Transfer Network for Partial Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lale Akarun,et al.  Sign Language Recognition for Assisting the Deaf in Hospitals , 2016, HBU.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Sunita Sarawagi,et al.  Efficient Domain Generalization via Common-Specific Low-Rank Decomposition , 2020, ICML.

[22]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[23]  Yongxin Yang,et al.  Learning to Generalize: Meta-Learning for Domain Generalization , 2017, AAAI.

[24]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jesse Read,et al.  Better Sign Language Translation with STMC-Transformer , 2020, COLING.

[26]  Tatsuya Harada,et al.  Domain Generalization Using a Mixture of Multiple Latent Domains , 2019, AAAI.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Mengjie Zhang,et al.  Domain Generalization for Object Recognition with Multi-task Autoencoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Yingming Li,et al.  SBAT: Video Captioning with Sparse Boundary-Aware Transformer , 2020, IJCAI.

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[32]  Yingming Li,et al.  Recurrent convolutional video captioning with global and local attention , 2019, Neurocomputing.

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[35]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[37]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.