论文信息 - Language-Agnostic Visual-Semantic Embeddings

Language-Agnostic Visual-Semantic Embeddings

This paper proposes a framework for training language-invariant cross-modal retrieval models. We also introduce a novel character-based word-embedding approach, allowing the model to project similar words across languages into the same word-embedding space. In addition, by performing cross-modal retrieval at the character level, the storage requirements for a text encoder decrease substantially, allowing for lighter and more scalable retrieval architectures. The proposed language-invariant textual encoder based on characters is virtually unaffected in terms of storage requirements when novel languages are added to the system. Our contributions include new methods for building character-level-based word-embeddings, an improved loss function, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance. We show that our models outperform the current state-of-the-art in both single and multi-language scenarios. This work can be seen as the basis of a new path on retrieval research, now allowing for the effective use of captions in multiple-language scenarios. Code is available at \url{https://github.com/jwehrmann/lavse}.

[1] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[2] Manaal Faruqui,et al. Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[5] Eneko Agirre,et al. Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[6] Aviv Eisenschtat,et al. Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[8] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10] Dong Wang,et al. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[11] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12] Zhedong Zheng,et al. Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[13] Nobuyuki Shimizu,et al. Cross-Lingual Image Caption Generation , 2016, ACL.

[14] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Rodrigo C. Barros,et al. A character-based convolutional neural network for language-agnostic Twitter sentiment analysis , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[16] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[17] Samuel L. Smith,et al. Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[18] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[19] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[20] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[21] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Frank Keller,et al. Image Pivoting for Learning Multilingual Multimodal Representations , 2017, EMNLP.

[23] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[24] Roi Reichart,et al. Bridging Languages through Images with Deep Partial Canonical Correlation Analysis , 2018, ACL.

[25] Trevor Darrell,et al. Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Hiroshi Kanayama,et al. Learning Crosslingual Word Embeddings without Bilingual Corpora , 2016, EMNLP.

[27] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[28] SangKeun Lee,et al. Learning to Generate Word Representations using Subword Information , 2018, COLING.

[29] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[30] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[32] Tiejun Zhao,et al. A Distribution-based Model to Learn Bilingual Word Embeddings , 2016, COLING.

[33] Jacob Eisenstein,et al. Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[34] David J. Fleet,et al. VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[35] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Rodrigo C. Barros,et al. Order embeddings and character-level convolutions for multimodal alignment , 2017, Pattern Recognit. Lett..

[37] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[38] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[39] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40] Eneko Agirre,et al. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[41] Wei Wang,et al. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[43] Rodrigo C. Barros,et al. Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.