VisualCOMET: Reasoning About the Dynamic Context of a Still Image
暂无分享,去创建一个
Yejin Choi | Ali Farhadi | Roozbeh Mottaghi | Chandra Bhagavatula | Jae Sung Park | Yejin Choi | Chandra Bhagavatula | Ali Farhadi | J. S. Park | Roozbeh Mottaghi
[1] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.
[2] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Tamara L. Berg,et al. Temporal Perception and Prediction in Ego-Centric Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[6] Bernt Schiele,et al. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[7] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.
[8] Yejin Choi,et al. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.
[9] Ruben Villegas,et al. High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.
[10] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.
[11] Silvio Savarese,et al. A Hierarchical Representation for Future Action Prediction , 2014, ECCV.
[12] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[13] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[14] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[15] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Yejin Choi,et al. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.
[17] Martial Hebert,et al. Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[18] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.
[19] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[20] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.
[21] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[22] C. Lawrence Zitnick,et al. Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[23] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[25] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[26] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[27] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.
[28] Cordelia Schmid,et al. Relational Action Forecasting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[30] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[32] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[33] Aaron C. Courville,et al. Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[36] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[37] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Martial Hebert,et al. The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[39] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[40] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[42] Jiajun Wu,et al. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.
[43] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[44] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[45] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[46] Ali Farhadi,et al. "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.
[47] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[48] Ashwin K. Vijayakumar,et al. Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.
[49] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Scott Cohen,et al. Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Antonio Torralba,et al. Inferring the Why in Images , 2014, ArXiv.
[53] Licheng Yu,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.
[54] Jitendra Malik,et al. Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[55] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.