Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.1

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Graham Neubig,et al.  Measuring and Increasing Context Usage in Context-Aware Machine Translation , 2021, ACL.

[3]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lucia Specia,et al.  Distilling Translations with Visual Awareness , 2019, ACL.

[6]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[7]  Jiebo Luo,et al.  A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation , 2020, ACL.

[8]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[10]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[11]  Jiebo Luo,et al.  Dynamic Context-guided Capsule Network for Multimodal Machine Translation , 2020, ACM Multimedia.

[12]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[13]  Jindrich Libovický,et al.  CUNI System for the WMT18 Multimodal Translation Task , 2018, WMT.

[14]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[15]  Desmond Elliott,et al.  Adversarial Evaluation of Multimodal Machine Translation , 2018, EMNLP.

[16]  Florian Metze,et al.  On Leveraging the Visual Modality for Neural Machine Translation , 2019, INLG.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[19]  Frank Keller,et al.  Cross-lingual Visual Verb Sense Disambiguation , 2019, NAACL.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Desmond Elliott,et al.  Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.

[22]  Nick Campbell,et al.  Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[23]  Lucia Specia,et al.  Sheffield Submissions for WMT18 Multimodal Translation Shared Task , 2018, WMT.

[24]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[25]  Wei Bi,et al.  Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation , 2021, ACL.

[26]  Lucia Specia,et al.  Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[27]  Lucia Specia,et al.  MultiSubs: A Large-scale Multimodal and Multilingual Dataset , 2021, ArXiv.

[28]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.