EDIS: Entity-Driven Image Search over Multimodal Web Content

Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.

[1]  Wittawat Jitkrittum,et al.  A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch , 2022, European Conference on Computer Vision.

[2]  Bryan A. Plummer,et al.  NewsStories: Illustrating articles with visual summaries , 2022, ECCV.

[3]  Conghui Hu,et al.  Feature Representation Learning for Unsupervised Cross-domain Image Retrieval , 2022, ECCV.

[4]  Mike Zheng Shou,et al.  GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval , 2022, European Conference on Computer Vision.

[5]  Errui Ding,et al.  ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Carl Vondrick,et al.  There’s a Time and Place for Reasoning Beyond the Image , 2022, ACL.

[7]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[8]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[9]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[10]  Yonatan Bisk,et al.  WebQA: Multihop and Multimodal QA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[12]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[13]  Yejin Choi,et al.  Misinfo Reaction Frames: Reasoning about Readers’ Reactions to News Headlines , 2021, ACL.

[14]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[17]  Jordi Pont-Tuset,et al.  Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[19]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[20]  Vicente Ordonez,et al.  Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[21]  Jason Baldridge,et al.  Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO , 2020, EACL.

[22]  Lexing Xie,et al.  Transform and Tell: Entity-Aware News Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[24]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[25]  Jordi Pont-Tuset,et al.  Connecting Vision and Language with Localized Narratives , 2019, ECCV.

[26]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[27]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[28]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[29]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[30]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[31]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[34]  Xiang Ren,et al.  NewsEdits: A News Article Revision Dataset and a Novel Document-Level Reasoning Challenge , 2022, North American Chapter of the Association for Computational Linguistics.

[35]  Mark Dredze,et al.  Updated Headline Generation: Creating Updated Summaries for Evolving News Stories , 2022, ACL.

[36]  Natalie Schluter,et al.  MassiveSumm: a very large-scale, very multilingual, news summarisation dataset , 2021, EMNLP.

[37]  Martha Palmer,et al.  NewsClaims: A New Benchmark for Claim Detection from News with Background Knowledge , 2021, ArXiv.