Gumbel-Attention for Multi-modal Machine Translation

Multi-modal machine translation (MMT) improves translation quality by introducing visual information. However, the existing MMT model ignores the problem that the image will bring information irrelevant to the text, causing much noise to the model and affecting the translation quality. In this paper, we propose a novel Gumbel-Attention for multi-modal machine translation, which selects the text-related parts of the image features. Specifically, different from the previous attentionbased method, we first use a differentiable method to select the image information and automatically remove the useless parts of the image features. Through the score matrix of Gumbel-Attention and image features, the image-aware text representation is generated. And then, we independently encode the text representation and the image-aware text representation with the multi-modal encoder. Finally, the final output of the encoder is obtained through multi-modal gated fusion. Experiments and case analysis prove that our method retains the image features related to the text, and the remaining parts help the MMT model generates better translations.

[1]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[2]  Xiaojun Wan,et al.  Multimodal Transformer for Multimodal Machine Translation , 2020, ACL.

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  Yong Jae Lee,et al.  A Visual Attention Grounding Neural Model for Multimodal Machine Translation , 2018, EMNLP.

[5]  Wilker Aziz,et al.  Latent Variable Model for Multi-modal Translation , 2018, ACL.

[6]  Ewan Klein,et al.  Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics , 2000, ACL 2000.

[7]  Xing Wang,et al.  How Does Selective Mechanism Improve Self-Attention Networks? , 2020, ACL.

[8]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[9]  Jean Oh,et al.  Attention-based Multimodal Neural Machine Translation , 2016, WMT.

[10]  B. Brookes,et al.  Statistical Theory of Extreme Values and Some Practical Applications , 1955, The Mathematical Gazette.

[11]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12]  Hai Zhao,et al.  Neural Machine Translation with Universal Visual Representation , 2020, ICLR.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Joachim Bingel,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 2016 .

[16]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Nick Campbell,et al.  Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[19]  Lucia Specia,et al.  Distilling Translations with Visual Awareness , 2019, ACL.

[20]  Qun Liu,et al.  Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[21]  Qin Jin,et al.  From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots , 2019, IJCAI.

[22]  Joost van de Weijer,et al.  LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.