A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation

Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images. However, dominant multi-modal NMT models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have potential to refine multi-modal representation learning. To deal with this issue, in this paper, we propose a novel graph-based multi-modal fusion encoder for NMT. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, these representations provide an attention-based context vector for the decoder. We evaluate our proposed encoder on the Multi30K datasets. Experimental results and in-depth analysis show the superiority of our multi-modal NMT model.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Gholamreza Haffari,et al.  Graph-to-Sequence Learning using Gated Graph Neural Networks , 2018, ACL.

[3]  Wilker Aziz,et al.  Latent Variable Model for Multi-modal Translation , 2018, ACL.

[4]  Desmond Elliott,et al.  Adversarial Evaluation of Multimodal Machine Translation , 2018, EMNLP.

[5]  Lucia Specia,et al.  Distilling Translations with Visual Awareness , 2019, ACL.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[8]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Desmond Elliott,et al.  Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.

[11]  Yue Zhang,et al.  A Graph-to-Sequence Model for AMR-to-Text Generation , 2018, ACL.

[12]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[14]  Yue Zhang,et al.  Sentence-State LSTM for Text Representation , 2018, ACL.

[15]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Stéphane Dupont,et al.  An empirical study on the effectiveness of images in Multimodal Neural Machine Translation , 2017, EMNLP.

[18]  Joost van de Weijer,et al.  LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.

[19]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[20]  Lucia Specia,et al.  Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[21]  Jindrich Libovický,et al.  CUNI System for the WMT18 Multimodal Translation Task , 2018, WMT.

[22]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Desmond Elliott,et al.  Imagination Improves Multimodal Translation , 2017, IJCNLP.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[26]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[27]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[28]  Fethi Bougares,et al.  Multimodal Attention for Neural Machine Translation , 2016, ArXiv.

[29]  Yong Jae Lee,et al.  A Visual Attention Grounding Neural Model for Multimodal Machine Translation , 2018, EMNLP.

[30]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[31]  Nick Campbell,et al.  Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[32]  Yubao Liu,et al.  Neural Collective Entity Linking Based on Recurrent Random Walk Network Learning , 2019, IJCAI.

[33]  Qun Liu,et al.  Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[34]  Yue Zhang,et al.  Exploring Graph-structured Passage Representation for Multi-hop Reading Comprehension with Graph Neural Networks , 2018, ArXiv.

[35]  Sarah Parisot,et al.  Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.

[36]  Jiebo Luo,et al.  Graph-based Neural Sentence Ordering , 2019, IJCAI.

[37]  Yue Zhang,et al.  Semantic Neural Machine Translation Using AMR , 2019, TACL.

[38]  Jean Oh,et al.  Attention-based Multimodal Neural Machine Translation , 2016, WMT.

[39]  Stéphane Dupont,et al.  Modulating and attending the source image during encoding improves Multimodal Translation , 2017, ArXiv.