Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
暂无分享,去创建一个
S. Ullman | R. Giryes | R. Feris | R. Panda | Roei Herzig | Donghyun Kim | Assaf Arbelle | Sivan Harary | Leonid Karlinsky | Amit Alfassy | Paola Cascante-Bonilla | Sivan Doveh | Rameswar Panda
[1] Trevor Darrell,et al. Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs , 2023, ArXiv.
[2] Haoming Jiang,et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond , 2023, ACM Trans. Knowl. Discov. Data.
[3] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[5] Ross B. Girshick,et al. Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[7] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[8] Trevor Darrell,et al. PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data , 2022, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[9] S. Ullman,et al. Teaching Structured Vision & Language Concepts to Vision & Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] R. Feris,et al. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning* , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] James Y. Zou,et al. When and why vision-language models behave like bags-of-words, and what to do about it? , 2022, ICLR.
[12] Kate Saenko,et al. FETA: Towards Specializing Foundation Models for Expert Task Applications , 2022, ArXiv.
[13] Percy Liang,et al. Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning , 2022, ArXiv.
[14] Tiancheng Zhao,et al. VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations , 2022, arXiv.org.
[15] Trevor Darrell,et al. Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens , 2022, 2206.06346.
[16] Ryan A. Rossi,et al. CyCLIP: Cyclic Contrastive Language-Image Pretraining , 2022, NeurIPS.
[17] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[18] Guy Edward Toh Emerson,et al. Visual Spatial Reasoning , 2022, TACL.
[19] Chunhua Shen,et al. PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining , 2022, NeurIPS.
[20] Yong Jae Lee,et al. ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models , 2022, NeurIPS.
[21] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[22] Tristan Thrush,et al. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Shalini De Mello,et al. GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Trishul M. Chilimbi,et al. Vision-Language Pre-Training with Triple Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[26] Yin Cui,et al. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels , 2021, ECCV.
[27] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Alexander S. Ecker,et al. Image Segmentation Using Text and Image Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Lu Yuan,et al. RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.
[32] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.
[33] David P. Kreil,et al. CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP , 2021, NeurIPS.
[34] Trevor Darrell,et al. Object-Region Video Transformers , 2021, Computer Vision and Pattern Recognition.
[35] Junjie Yan,et al. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.
[36] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[37] Scott Cohen,et al. Learning to Predict Visual Attributes in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Yin Cui,et al. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.
[39] Cordelia Schmid,et al. Unified Graph Structured Models for Video Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[40] Stella Biderman,et al. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .
[41] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[42] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[43] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[44] Shih-Fu Chang,et al. Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Jonathan Berant,et al. Learning Object Detection from Captions via Textual Scene Attributes , 2020, ArXiv.
[46] Chen Gao,et al. DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.
[47] Trevor Darrell,et al. Compositional Video Synthesis with Action Graphs , 2020, ICML.
[48] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[49] Ali Farhadi,et al. Grounded Situation Recognition , 2020, ECCV.
[50] Trevor Darrell,et al. Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Trevor Darrell,et al. Learning Canonical Representations for Scene Graph to Image Generation , 2019, ECCV.
[52] Juan Carlos Niebles,et al. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[55] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[56] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[57] Wei Li,et al. Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Mohan S. Kankanhalli,et al. Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Cewu Lu,et al. HAKE: Human Activity Knowledge Engine , 2019, ArXiv.
[60] Jonathan Berant,et al. Differentiable Scene Graphs , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[61] Trevor Darrell,et al. Spatio-Temporal Action Graph Networks , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[62] Yin Li,et al. Compositional Learning for Human Object Interaction , 2018, ECCV.
[63] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[64] Andreas Dengel,et al. Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.
[65] Cordelia Schmid,et al. A flexible model for training action localization with varying levels of supervision , 2018, NeurIPS.
[66] Christian Wolf,et al. Object Level Visual Reasoning in Videos , 2018, ECCV.
[67] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.
[68] Razvan Pascanu,et al. Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.
[69] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[70] Michael S. Bernstein,et al. Referring Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[71] Jonathan Berant,et al. Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction , 2018, NeurIPS.
[72] Ivan Laptev,et al. Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[73] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Guy Cazuguel,et al. Multiple-Instance Learning for Medical Image and Video Analysis , 2017, IEEE Reviews in Biomedical Engineering.
[75] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[76] N. Rajpoot,et al. Locality Sensitive Deep Learning for Detection and Classification of Nuclei in Routine Colon Cancer Histology Images , 2016, IEEE Trans. Medical Imaging.
[77] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[78] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[79] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[80] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[81] Cordelia Schmid,et al. Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.
[82] Greg Mori,et al. Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification , 2012, ECCV.
[83] Yang Song,et al. Handling label noise in video classification via multiple instance learning , 2011, 2011 International Conference on Computer Vision.
[84] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .