论文信息 - A Visual Attention Grounding Neural Model for Multimodal Machine Translation

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Our approach achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets. We also collected a new multilingual multimodal product description dataset to simulate a real-world international online shopping scenario. On this dataset, our visual attention grounding model outperforms other methods by a large margin.

Yong Jae Lee | Zhou Yu | Mingyang Zhou | Runxiang Cheng

[1] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5] Grzegorz Chrupala,et al. Learning language through pictures , 2015, ACL.

[6] Jean Oh,et al. Attention-based Multimodal Neural Machine Translation , 2016, WMT.

[7] Joost van de Weijer,et al. LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.

[8] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[9] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[10] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Jindřich Helcl,et al. CUNI System for the WMT18 Multimodal Translation Task , 2018, WMT.

[13] Nick Campbell,et al. Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[14] Dapeng Li,et al. OSU Multimodal Machine Translation System Report , 2017, WMT.

[15] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17] Desmond Elliott,et al. Imagination Improves Multimodal Translation , 2017, IJCNLP.

[18] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.

[19] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[20] Desmond Elliott,et al. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[21] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[22] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23] Philipp Koehn,et al. Neural Machine Translation , 2017, ArXiv.

[24] Frank Keller,et al. Image Pivoting for Learning Multilingual Multimodal Representations , 2017, EMNLP.

[25] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[26] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27] Rico Sennrich,et al. Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[28] Qun Liu,et al. Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[29] Nick Campbell,et al. Multilingual Multi-modal Embeddings for Natural Language Processing , 2017, ArXiv.

[30] Satoshi Nakamura,et al. NICT-NAIST System for WMT17 Multimodal Translation Task , 2017, WMT.