论文信息 - AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision-language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image-text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, the dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.

[1] Nandan Thakur,et al. Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages , 2022, ArXiv.

[2] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[3] Jelena Tesic,et al. NewsImages: addressing the depiction gap with an online news dataset for text-image rematching , 2022, MMSys.

[4] Dustin Schwenk,et al. A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , 2022, ECCV.

[5] Seong Joon Oh,et al. ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO , 2022, ECCV.

[6] Fei Wang,et al. Where Does the Performance Improvement Come From?: - A Reproducibility Concern about Image-Text Retrieval , 2022, SIGIR.

[7] M. de Rijke,et al. Do Lessons from Metric Learning Generalize to Image-Caption Retrieval? , 2022, ECIR.

[8] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[9] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Rossano Schifanella,et al. A large scale study of reader interactions with images on Wikipedia , 2021, EPJ Data Science.

[12] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[14] Noah D. Goodman,et al. Concadia: Towards Image-Based Text Generation with a Purpose , 2021, EMNLP.

[15] Nils Reimers,et al. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval , 2021, TACL.

[16] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[17] Yashin Dicente Cid,et al. Overview of the ImageCLEF 2022: Multimedia Retrieval in Medical, Social Media and Nature Applications , 2022, CLEF.

[18] Elias Bassani. ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison , 2022, ECIR.

[19] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[20] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[21] Jiecao Chen,et al. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[22] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[24] Seong Joon Oh,et al. Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Vicente Ordonez,et al. Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[26] Muhammad Iqbal Ripo Putra,et al. The design of multimedia storytelling , 2020, ELT Forum: Journal of English Language Teaching.

[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[28] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[29] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30] Daniel Preotiuc-Pietro,et al. Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts , 2019, ACL.

[31] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Dan Jurafsky,et al. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP.

[33] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[34] Larry S. Davis,et al. Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Hidetsugu Nanba,et al. Enriching Travel Guidebooks with Travel Blog Entries and Archives of Answered Questions , 2016, ENTER.