论文信息 - Visual Entailment Task for Visually-Grounded Language Learning

Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at this https URL) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

[1] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[5] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[8] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[9] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[10] Anton van den Hengel,et al. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[12] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[13] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[15] Svetlana Lazebnik,et al. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[16] Albert Gatt,et al. Grounded Textual Entailment , 2018, COLING.

[17] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[18] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[19] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[20] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[23] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).