CUNI System for the WMT18 Multimodal Translation Task

We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a self-attentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it as an auxiliary objective. For our submission, we acquired both textual and multimodal additional data. Both of the proposed methods yield significant improvements over recurrent networks and self-attentive textual baselines.

[1]  Jindrich Libovický,et al.  Neural Monkey: An Open-source Tool for Sequence Learning , 2017, Prague Bull. Math. Linguistics.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[5]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[6]  Lucia Specia,et al.  SHEF-Multimodal: Grounding Machine Translation on Images , 2016, WMT.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[9]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[10]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Qun Liu,et al.  Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[17]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[18]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[19]  Jindřich Helcl,et al.  CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks , 2016, WMT.

[20]  Jindrich Libovický,et al.  Attention Strategies for Multi-Source Sequence-to-Sequence Learning , 2017, ACL.

[21]  Jindřich Helcl,et al.  CUNI System for the WMT18 Multimodal Translation Task , 2018, WMT.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.