暂无分享,去创建一个
Chitta Baral | Tejas Gokhale | Yezhou Yang | Zhiyuan Fang | Pratyay Banerjee | Chitta Baral | Yezhou Yang | Pratyay Banerjee | Zhiyuan Fang | Tejas Gokhale
[1] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[2] Antonio Torralba,et al. Predicting Motivations of Actions by Leveraging Text , 2014, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[4] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[5] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[7] Shu Kong,et al. Modularized Textual Grounding for Counterfactual Resilience , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Trevor Darrell,et al. Textual Explanations for Self-Driving Vehicles , 2018, ECCV.
[9] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[10] Mark Johnson,et al. An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.
[11] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.
[12] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[13] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[14] Yejin Choi,et al. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.
[15] Bhavana Dalvi,et al. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.
[16] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.
[17] Chitta Baral,et al. Cooking With Blocks : A Recipe for Visual Reasoning on Image-Pairs , 2019, CVPR Workshops.
[18] Juan Carlos Niebles,et al. Procedure Planning in Instructional Videos , 2019, ECCV.
[19] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[20] Zhe Gan,et al. Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[22] Heng Tao Shen,et al. Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.
[23] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.
[24] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[25] Yejin Choi,et al. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.
[26] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[27] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Chitta Baral,et al. Blocksworld Revisited: Learning and Reasoning to Generate Event-Sequences from Image Pairs , 2019, ArXiv.
[29] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.
[30] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[31] Louis-Philippe Morency,et al. Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Shu Kong,et al. Weakly Supervised Attention Learning for Textual Phrases Grounding , 2018, ArXiv.
[34] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[35] Chitta Baral,et al. VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.
[36] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.
[37] C. Lawrence Zitnick,et al. Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[38] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[39] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.
[40] Jonathan Krause,et al. A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[43] Larry S. Davis,et al. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[44] Gary Marcus,et al. Deep Learning: A Critical Appraisal , 2018, ArXiv.
[45] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Antonio Torralba,et al. Inferring the Why in Images , 2014, ArXiv.
[47] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.
[48] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[49] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Yejin Choi,et al. Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.
[52] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.
[53] Grace Hui Yang,et al. VideoQA: question answering on news video , 2003, MULTIMEDIA '03.
[54] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.