GIT: A Generative Image-to-text Transformer for Vision and Language
暂无分享,去创建一个
Zhe Gan | Kevin Lin | Lijuan Wang | Zicheng Liu | Linjie Li | Jianfeng Wang | Ce Liu | Zhengyuan Yang | Xiaowei Hu
[1] Errui Ding,et al. MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining , 2022, ArXiv.
[2] Jingren Zhou,et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections , 2022, EMNLP.
[3] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[4] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.
[5] N. Codella,et al. DaViT: Dual Attention Vision Transformers , 2022, ECCV.
[6] Jianfeng Gao,et al. Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[8] Mike Zheng Shou,et al. All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[10] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[11] C. Schmid,et al. End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Fengxiang He,et al. Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition , 2021, AAAI.
[13] Srikar Appalaraju,et al. LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Xiaowei Hu,et al. Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Faisal Ahmed,et al. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.
[18] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[19] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.
[20] Tsu-Jui Fu,et al. VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.
[21] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[22] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.
[23] Dian Li,et al. CLIP4Caption ++: Multi-CLIP for Video Caption , 2021, ArXiv.
[24] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[25] Jianlong Fu,et al. Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.
[26] Xianbiao Qi,et al. Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model , 2021, ArXiv.
[27] Zhe Gan,et al. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.
[28] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[29] Qi Wu,et al. Towards Accurate Text-based Image Captioning with Content Diversity Exploration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Lijuan Wang,et al. Compressing Visual-linguistic Model via Knowledge Distillation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Yongdong Zhang,et al. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Chunfeng Yuan,et al. Open-book Video Captioning with Retrieve-Copy-Generate Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[34] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[37] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[38] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[39] Lei Zhang,et al. VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.
[40] Hua Wu,et al. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.
[41] Paul Hongsuck Seo,et al. Look Before you Speak: Visually Contextualized Utterances , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[44] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[45] Svetha Venkatesh,et al. Hierarchical Conditional Relation Networks for Multimodal Video Question Answering , 2021, Int. J. Comput. Vis..
[46] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[47] Justin Johnson,et al. VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Hui Li,et al. Structured Multimodal Attentions for TextVQA , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[49] Sheng Liu,et al. SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[50] Xiang Bai,et al. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[51] Rita Cucchiara,et al. Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation , 2021, ArXiv.
[52] Si Liu,et al. Multiple Transformer Mining for VizWiz Image Caption , 2021 .
[53] Zhe Gan,et al. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.
[54] Xiujun Li,et al. MiniVLM: A Smaller and Faster Vision-Language Model , 2020, ArXiv.
[55] Mingkui Tan,et al. Cascade Reasoning Network for Text-based Visual Question Answering , 2020, ACM Multimedia.
[56] Wei Han,et al. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering , 2020, COLING.
[57] Jianfeng Gao,et al. VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.
[58] Yu-Gang Jiang,et al. Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos , 2020, ECCV.
[59] A. Schwing,et al. Spatially Aware Multimodal Transformers for TextVQA , 2020, ECCV.
[60] Zhanghui Kuang,et al. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition , 2020, ECCV.
[61] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[62] Dacheng Tao,et al. Syntax-Aware Action Targeting for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[65] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[66] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[67] Yue Gao,et al. Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.
[68] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.
[69] Juan Carlos Niebles,et al. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Errui Ding,et al. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Marcus Rohrbach,et al. TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.
[72] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[73] D. Gurari,et al. Captioning Images Taken by People Who Are Blind , 2020, ECCV.
[74] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[75] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.
[76] S. Gelly,et al. Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.
[77] Trevor Darrell,et al. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[78] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[79] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[80] Jiebo Luo,et al. Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[81] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[82] Wei Liu,et al. Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[83] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[84] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[85] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[86] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[87] Yu-Gang Jiang,et al. Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.
[88] Yuxin Peng,et al. Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[89] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[90] Ernest Valveny,et al. Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[91] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[92] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[93] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[94] Wei Liu,et al. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[95] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[96] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[97] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[98] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[99] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[100] Qingming Huang,et al. Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.
[101] Jiebo Luo,et al. VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[102] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[103] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[104] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[105] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[106] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[107] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[108] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[109] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[110] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[111] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[112] A. Vedaldi,et al. Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[113] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[114] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[115] Andrew Zisserman,et al. Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.
[116] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[117] Palaiahnakote Shivakumara,et al. A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..
[118] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.
[119] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[120] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[121] Palaiahnakote Shivakumara,et al. Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.
[122] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.
[123] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.
[124] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[125] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.
[126] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[127] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[128] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.
[129] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.
[130] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.