论文信息 - UNITER: UNiversal Image-TExt Representation Learning

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at this https URL.

[1] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[2] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[3] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[4] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[5] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[7] Han Zhang,et al. Improving GANs Using Optimal Transport , 2018, ICLR.

[8] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[9] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[10] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[11] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[13] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.

[14] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[16] Marco Cuturi,et al. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[17] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19] Gabriel Peyré,et al. Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[20] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[21] Hongyuan Zha,et al. A Fast Proximal Point Method for Wasserstein Distance , 2018, ArXiv.

[22] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[23] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[24] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[25] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[29] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[30] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[31] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[32] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[34] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36] Yu Cheng,et al. Graph Optimal Transport for Cross-Domain Alignment , 2020, ICML.

[37] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[38] Gabriel Peyré,et al. Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[39] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[40] Quoc V. Le,et al. Selfie: Self-supervised Pretraining for Image Embedding , 2019, ArXiv.

[41] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[42] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[44] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[45] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[46] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[47] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[49] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[50] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[51] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[53] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[54] Yu Cheng,et al. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.

[55] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Yoav Artzi,et al. NLVR2 Visual Bias Analysis , 2019, ArXiv.

[57] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[58] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[59] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[60] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[61] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).