VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
暂无分享,去创建一个
[1] Yonghui Wu,et al. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners , 2022, 2212.04979.
[2] Yi Wang,et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning , 2022, ArXiv.
[3] Jianlong Fu,et al. Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning , 2022, NeurIPS.
[4] Xin Wang,et al. AVQA: A Dataset for Audio-Visual Question Answering on Videos , 2022, ACM Multimedia.
[5] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.
[6] Jianlong Fu,et al. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment , 2022, ArXiv.
[7] William Yang Wang,et al. An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling , 2022, ArXiv.
[8] Rongrong Ji,et al. Clover: Towards A Unified Video-Language Alignment and Fusion Model , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Luhui Xu,et al. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.
[10] Ming Yan,et al. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.
[11] Aniruddha Kembhavi,et al. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.
[12] C. Schmid,et al. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , 2022, NeurIPS.
[13] Zhe Gan,et al. LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Tamara L. Berg,et al. Revealing Single Frame Bias for Video-and-Language Learning , 2022, ACL.
[15] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[16] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[17] N. Codella,et al. i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.
[18] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[19] Alex Wang,et al. MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval , 2022, ECCV.
[20] N. Codella,et al. DaViT: Dual Attention Vision Transformers , 2022, ECCV.
[21] Mohit Bansal,et al. ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound , 2022, ECCV.
[22] C. Schmid,et al. Learning Audio-Video Modalities from Image Captions , 2022, ECCV.
[23] Yapeng Tian,et al. Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Xiansheng Hua,et al. Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.
[25] Mike Zheng Shou,et al. All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Wenwu Wang,et al. Leveraging Pre-trained BERT for Audio Captioning , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).
[27] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[28] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[29] C. Schmid,et al. End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Xihui Liu,et al. Bridging Video-text Retrieval with Multiple Choice Questions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Yejin Choi,et al. MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Fenglin Liu,et al. Aligning Source Visual and Target Language Domains for Unpaired Video Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[33] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Mike Zheng Shou,et al. Object-aware Video-language Pre-training for Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Faisal Ahmed,et al. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Faisal Ahmed,et al. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.
[38] B. Guo,et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.
[40] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.
[41] Cordelia Schmid,et al. Masking Modalities for Cross-modal Video Retrieval , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[42] James R. Glass,et al. SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.
[43] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.
[44] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[45] Chng Eng Siong,et al. Automated Audio Captioning Using Transfer Learning and Reconstruction Latent Space Similarity Regularization , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[46] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[48] Xirong Li,et al. Dual Encoding for Video Retrieval by Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[49] Zi Huang,et al. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[50] Bingbing Ni,et al. Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[51] Quoc V. Le,et al. Combined Scaling for Open-Vocabulary Image Classification , 2022 .
[52] Zhifeng Li,et al. HunYuan_tvr for Text-Video Retrievial , 2022, ArXiv.
[53] Tsu-Jui Fu,et al. VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.
[54] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[55] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.
[56] Longteng Guo,et al. MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques , 2021, ACM Multimedia.
[57] Youngjae Yu,et al. Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Fan Yang,et al. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.
[59] Yonatan Bisk,et al. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[60] Mark D. Plumbley,et al. Audio Captioning Transformer , 2021, DCASE.
[61] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[62] Zhe Gan,et al. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.
[63] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[64] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Florian Metze,et al. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.
[66] Zeynep Akata,et al. Audio Retrieval with Natural Language Queries , 2021, Interspeech.
[67] James Glass,et al. Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[68] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[69] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[70] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[71] Chunfeng Yuan,et al. Open-book Video Captioning with Retrieve-Copy-Generate Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[73] Kai Yu,et al. Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[74] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[75] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[76] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[77] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[78] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[79] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[80] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[81] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.
[82] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[83] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[84] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[85] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[86] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[87] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[88] Yue Gao,et al. Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.
[89] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[90] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[91] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[92] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[93] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[94] Kun Chen,et al. Audio Captioning Based on Transformer and Pre-Trained CNN , 2020, DCASE.
[95] Kai Yu,et al. A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning , 2020, DCASE.
[96] Wei Liu,et al. Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[97] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[98] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[99] Jun Yu,et al. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.
[100] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[101] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[102] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[103] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[104] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.
[105] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[106] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[107] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[108] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[109] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[110] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[111] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[112] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[113] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[114] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[115] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[116] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[117] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[118] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[119] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[120] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[121] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[122] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[123] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[124] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[125] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[126] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.