暂无分享,去创建一个
Roozbeh Mottaghi | Ashish Sabharwal | Jialin Wu | Jiasen Lu | Ashish Sabharwal | Jialin Wu | Roozbeh Mottaghi | Jiasen Lu
[1] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[2] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[3] François Gardères,et al. ConceptBert: Concept-Aware Representation for Visual Question Answering , 2020, FINDINGS.
[4] Qi Wu,et al. FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[5] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.
[6] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Peter Clark,et al. Answering Complex Questions Using Open Information Extraction , 2017, ACL.
[8] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.
[9] W. Bruce Croft,et al. Passage Retrieval for Outside-Knowledge Visual Question Answering , 2021, SIGIR.
[10] Jianlong Fu,et al. Learning Rich Image Region Representation for Visual Question Answering , 2019, ArXiv.
[11] Jason Weston,et al. Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.
[12] Anton van den Hengel,et al. Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge , 2021, LANTERN.
[13] Matthieu Cord,et al. MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[15] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Wai Lam,et al. AnswerFact: Fact Checking in Product Question Answering , 2020, EMNLP.
[17] Marcus Rohrbach,et al. KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[19] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[20] Junyeong Kim,et al. Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.
[23] Ming-Wei Chang,et al. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .
[24] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[26] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[27] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Xin Wang,et al. Boosting Visual Question Answering with Context-aware Knowledge Aggregation , 2020, ACM Multimedia.
[29] Chunhua Shen,et al. Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.
[30] Percy Liang,et al. Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.
[31] Raymond J. Mooney,et al. Improving VQA and its Explanations by Comparing Competing Explanations , 2020, ArXiv.
[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[34] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.
[35] Eunsol Choi,et al. Decontextualization: Making Sentences Stand-Alone , 2021, Transactions of the Association for Computational Linguistics.
[36] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[37] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[38] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.
[39] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[40] Luke S. Zettlemoyer,et al. End-to-end Neural Coreference Resolution , 2017, EMNLP.
[41] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[42] Alessandro Moschitti,et al. Joint Models for Answer Verification in Question Answering Systems , 2021, ACL.
[43] Svetlana Lazebnik,et al. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering , 2018, NeurIPS.
[44] Luke S. Zettlemoyer,et al. AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.
[45] William W. Cohen,et al. PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text , 2019, EMNLP.
[46] Alexander G. Schwing,et al. Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering , 2018, ECCV.
[47] Peng Wang,et al. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.
[49] Yue Hu,et al. Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering , 2020, IJCAI.
[50] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[51] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[52] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).