论文信息 - Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.

[1] Jing Liu,et al. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset , 2023, ArXiv.

[2] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[3] Zhe Gan,et al. GRiT: A Generative Region-to-text Transformer for Object Understanding , 2022, ArXiv.

[4] Haoqi Fan,et al. Scaling Language-Image Pre-Training via Masking , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[7] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[8] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[9] Radu Soricut,et al. All You May Need for VQA are Image Captions , 2022, NAACL.

[10] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[11] Kurt Debattista,et al. Region-Object Relation-Aware Dense Captioning via Transformer. , 2022, IEEE transactions on neural networks and learning systems.

[12] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[14] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[15] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[17] B. Guo,et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Quoc V. Le,et al. Combined Scaling for Zero-shot Transfer Learning , 2021, Neurocomputing.

[19] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[21] Li Dong,et al. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[22] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[23] D. Xuan,et al. Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering , 2021, EMNLP.

[24] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[25] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[26] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[30] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[31] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[32] Heng Tao Shen,et al. Radial Graph Convolutional Network for Visual Question Generation , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[33] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[34] Xinlei Chen,et al. In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[36] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[37] Jungong Han,et al. Learning Object Context for Dense Captioning , 2019, AAAI.

[38] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Nenghai Yu,et al. Context and Attribute Grounded Dense Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[42] Bolei Zhou,et al. Visual Question Generation as Dual Task of Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Christopher Kanan,et al. Data Augmentation for Visual Question Answering , 2017, INLG.

[44] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Li-Jia Li,et al. Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Margaret Mitchell,et al. Generating Natural Questions About an Image , 2016, ACL.

[47] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[48] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[50] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[51] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[54] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[55] Song-Chun Zhu,et al. CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization , 2021, EMNLP.

[56] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[57] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.