Cross-Modal Coherence for Text-to-Image Retrieval

Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Model for text-to-image retrieval task. Our analysis shows that models trained with image–text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherenceagnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.

[1]  Jason Weston,et al.  Engaging Image Captioning via Personality , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  A. D. Manning,et al.  Understanding Comics: The Invisible Art , 1993 .

[3]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[4]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[5]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[6]  Matthew Stone,et al.  CITE: A Corpus of Image-Text Discourse Relations , 2019, NAACL.

[7]  Yorick Wilks,et al.  NLP for Indexing and Retrieval of Captioned Photographs , 2003, EACL.

[8]  J. Hobbs On the coherence and structure of discourse , 1985 .

[9]  Chong-Wah Ngo,et al.  Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval , 2018, ACM Multimedia.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.

[14]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[16]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[17]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[18]  Matthew Stone,et al.  A Formal Semantic Analysis of Gesture , 2009, J. Semant..

[19]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[20]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[21]  Maite Taboada,et al.  Applications of Rhetorical Structure Theory , 2006 .

[22]  Vladimir Pavlovic,et al.  CookGAN: Meal Image Synthesis from Ingredients , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[24]  Matthew Stone,et al.  Cross-modal Coherence Modeling for Caption Generation , 2020, ACL.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Frank Keller,et al.  Query-by-Example Image Retrieval using Visual Dependency Representations , 2014, COLING.