Violin: A Large-Scale Dataset for Video-and-Language Inference
暂无分享,去创建一个
Jingzhou Liu | Wenhu Chen | Zhe Gan | Licheng Yu | Jingjing Liu | Yiming Yang | Yu Cheng | J. Liu | Licheng Yu | Wenhu Chen | Zhe Gan | Jingjing Liu
[1] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.
[2] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.
[3] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[5] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[6] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[7] Sharath Pankanti,et al. Temporal Sequence Modeling for Video Event Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[8] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[9] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[10] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.
[11] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.
[13] Zhe Gan,et al. Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[15] Chuang Gan,et al. CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.
[16] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[17] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[18] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[19] Tom Goldstein,et al. FreeLB: Enhanced Adversarial Training for Language Understanding , 2019, ICLR 2020.
[20] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.
[21] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[22] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[23] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.
[24] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.
[25] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[26] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[27] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[28] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[29] Quoc V. Le,et al. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.
[30] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.
[31] Zhe Gan,et al. Adaptive Feature Abstraction for Translating Video to Text , 2018, AAAI.
[32] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[33] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Tegan Maharaj,et al. A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[38] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Yu Cheng,et al. Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[41] Zhe Gan,et al. StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[43] Yu Cheng,et al. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog , 2019, ACL.
[44] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.
[46] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[47] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[48] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[49] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[51] Yu Cheng,et al. FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.
[52] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Matthieu Cord,et al. MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Licheng Yu,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.
[56] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.
[57] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[58] Sarah Parisot,et al. Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.
[59] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[60] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[61] Byoung-Tak Zhang,et al. DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.
[62] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[63] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[64] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[65] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[66] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[67] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[68] Jing Wang,et al. Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Lin Ma,et al. Temporally Grounding Natural Sentence in Video , 2018, EMNLP.
[71] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.