Multimodal Transformer for Multimodal Machine Translation

Multimodal Machine Translation (MMT) aims to introduce information from other modality, generally static images, to improve the translation quality. Previous works propose various incorporation methods, but most of them do not consider the relative importance of multiple modalities. Equally treating all modalities may encode too much useless information from less important modalities. In this paper, we introduce the multimodal self-attention in Transformer to solve the issues above in MMT. The proposed method learns the representation of images based on the text, which avoids encoding irrelevant information in images. Experiments and visualization analysis demonstrate that our model benefits from visual information and substantially outperforms previous works and competitive baselines in terms of various metrics.

[1]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[2]  Qun Liu,et al.  Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Desmond Elliott,et al.  Imagination Improves Multimodal Translation , 2017, IJCNLP.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Nick Campbell,et al.  Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[7]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[8]  Desmond Elliott,et al.  Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.

[9]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Desmond Elliott,et al.  Adversarial Evaluation of Multimodal Machine Translation , 2018, EMNLP.

[12]  Lucia Specia,et al.  Distilling Translations with Visual Awareness , 2019, ACL.

[13]  Wilker Aziz,et al.  Latent Variable Model for Multi-modal Translation , 2018, ACL.

[14]  Xiaojun Wan,et al.  Heterogeneous Graph Transformer for Graph-to-Sequence Learning , 2020, ACL.

[15]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.