AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision-language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image-text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, the dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.

[1]  Nandan Thakur,et al.  Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages , 2022, ArXiv.

[2]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[3]  Jelena Tesic,et al.  NewsImages: addressing the depiction gap with an online news dataset for text-image rematching , 2022, MMSys.

[4]  Dustin Schwenk,et al.  A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , 2022, ECCV.

[5]  Seong Joon Oh,et al.  ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO , 2022, ECCV.

[6]  Fei Wang,et al.  Where Does the Performance Improvement Come From?: - A Reproducibility Concern about Image-Text Retrieval , 2022, SIGIR.

[7]  M. de Rijke,et al.  Do Lessons from Metric Learning Generalize to Image-Caption Retrieval? , 2022, ECIR.

[8]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[9]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Marcus Rohrbach,et al.  FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rossano Schifanella,et al.  A large scale study of reader interactions with images on Wikipedia , 2021, EPJ Data Science.

[12]  Daniel Keysers,et al.  LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[14]  Noah D. Goodman,et al.  Concadia: Towards Image-Based Text Generation with a Purpose , 2021, EMNLP.

[15]  Nils Reimers,et al.  Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval , 2021, TACL.

[16]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[17]  Yashin Dicente Cid,et al.  Overview of the ImageCLEF 2022: Multimedia Retrieval in Medical, Social Media and Nature Applications , 2022, CLEF.

[18]  Elias Bassani ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison , 2022, ECIR.

[19]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[20]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[21]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[24]  Seong Joon Oh,et al.  Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Vicente Ordonez,et al.  Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[26]  Muhammad Iqbal Ripo Putra,et al.  The design of multimedia storytelling , 2020, ELT Forum: Journal of English Language Teaching.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Daniel Preotiuc-Pietro,et al.  Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts , 2019, ACL.

[31]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dan Jurafsky,et al.  Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP.

[33]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[34]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hidetsugu Nanba,et al.  Enriching Travel Guidebooks with Travel Blog Entries and Archives of Answered Questions , 2016, ENTER.

[37]  Gabriela Csurka,et al.  Unsupervised Visual and Textual Information Fusion in CBMIR Using Graph-Based Methods , 2015, TOIS.

[38]  Gareth J. F. Jones,et al.  Overview of the Automated Story Illustration Task at FIRE 2015 , 2015, FIRE Workshops.

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40]  S. Shyam Sundar,et al.  Lights, Camera, Music, Interaction! Interactive Persuasion in E-commerce , 2014, Commun. Res..

[41]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[42]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[43]  Vivien Petras,et al.  Introduction to the CLEF 2011 Labs , 2011, CLEF.

[44]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[45]  Adrian Popescu,et al.  Overview of the Wikipedia Retrieval Task at ImageCLEF 2010 , 2010, CLEF.

[46]  Marina Bosch,et al.  ImageCLEF, Experimental Evaluation in Visual Information Retrieval , 2010 .

[47]  Gabriela Csurka,et al.  Crossing textual and visual content in different application scenarios , 2009, Multimedia Tools and Applications.

[48]  Andrew Trotman,et al.  Overview of the INEX 2007 Ad Hoc Track , 2008, INEX.

[49]  Linda C. Smith,et al.  INFORMATION QUALITY IN A COMMUNITY-BASED ENCYCLOPEDIA , 2005 .

[50]  Marilyn Domas White,et al.  A taxonomy of relationships between images and text , 2003, J. Documentation.

[51]  J. Levin,et al.  On pictures in prose , 1978 .