GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a G enerative I mage-to-text T ransformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex struc-tures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at https://github.com/microsoft/GenerativeImage2Text .

[1]  Errui Ding,et al.  MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining , 2022, ArXiv.

[2]  Jingren Zhou,et al.  mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections , 2022, EMNLP.

[3]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[4]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.

[5]  N. Codella,et al.  DaViT: Dual Attention Vision Transformers , 2022, ECCV.

[6]  Jianfeng Gao,et al.  Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[8]  Mike Zheng Shou,et al.  All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[10]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[11]  C. Schmid,et al.  End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Fengxiang He,et al.  Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition , 2021, AAAI.

[13]  Srikar Appalaraju,et al.  LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiaowei Hu,et al.  Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Faisal Ahmed,et al.  SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaowei Hu,et al.  Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[18]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[19]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[20]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[21]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[22]  Zhe Gan,et al.  UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.

[23]  Dian Li,et al.  CLIP4Caption ++: Multi-CLIP for Video Caption , 2021, ArXiv.

[24]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[25]  Jianlong Fu,et al.  Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.

[26]  Xianbiao Qi,et al.  Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model , 2021, ArXiv.

[27]  Zhe Gan,et al.  VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[28]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[29]  Qi Wu,et al.  Towards Accurate Text-based Image Captioning with Content Diversity Exploration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lijuan Wang,et al.  Compressing Visual-linguistic Model via Knowledge Distillation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Yongdong Zhang,et al.  Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chunfeng Yuan,et al.  Open-book Video Captioning with Retrieve-Copy-Generate Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[34]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[37]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[38]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[39]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[40]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[41]  Paul Hongsuck Seo,et al.  Look Before you Speak: Visually Contextualized Utterances , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jiebo Luo,et al.  TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  C. Schmid,et al.  Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[45]  Svetha Venkatesh,et al.  Hierarchical Conditional Relation Networks for Multimodal Video Question Answering , 2021, Int. J. Comput. Vis..

[46]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[47]  Justin Johnson,et al.  VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hui Li,et al.  Structured Multimodal Attentions for TextVQA , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Sheng Liu,et al.  SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Rita Cucchiara,et al.  Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation , 2021, ArXiv.

[52]  Si Liu,et al.  Multiple Transformer Mining for VizWiz Image Caption , 2021 .

[53]  Zhe Gan,et al.  Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.

[54]  Xiujun Li,et al.  MiniVLM: A Smaller and Faster Vision-Language Model , 2020, ArXiv.

[55]  Mingkui Tan,et al.  Cascade Reasoning Network for Text-based Visual Question Answering , 2020, ACM Multimedia.

[56]  Wei Han,et al.  Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering , 2020, COLING.

[57]  Jianfeng Gao,et al.  VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[58]  Yu-Gang Jiang,et al.  Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos , 2020, ECCV.

[59]  A. Schwing,et al.  Spatially Aware Multimodal Transformers for TextVQA , 2020, ECCV.

[60]  Zhanghui Kuang,et al.  RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition , 2020, ECCV.

[61]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[62]  Dacheng Tao,et al.  Syntax-Aware Action Targeting for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[65]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[66]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[67]  Yue Gao,et al.  Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.

[68]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[69]  Juan Carlos Niebles,et al.  Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Marcus Rohrbach,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[72]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  D. Gurari,et al.  Captioning Images Taken by People Who Are Blind , 2020, ECCV.

[74]  Xilin Chen,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[75]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[76]  S. Gelly,et al.  Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[77]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[79]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[80]  Jiebo Luo,et al.  Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[82]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[83]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[84]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[85]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[86]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[87]  Yu-Gang Jiang,et al.  Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.

[88]  Yuxin Peng,et al.  Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[91]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[93]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[94]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[97]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[98]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[99]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[100]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[101]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[102]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[103]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[104]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[105]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[107]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[110]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[111]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[112]  A. Vedaldi,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[114]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[115]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[116]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[117]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[118]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[119]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[120]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[121]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[122]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[123]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[124]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[125]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[126]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[127]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[128]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[129]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[130]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.