Multimodal Logical Inference System for Visual-Textual Entailment

A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.

[1]  John McCarthy,et al.  Applications of Circumscription to Formalizing Common Sense Knowledge , 1987, NMR.

[2]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[3]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[4]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[6]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Albert Gatt,et al.  Grounded Textual Entailment , 2018, COLING.

[8]  Francis Jeffry Pelletier,et al.  Representation and Inference for Natural Language: A First Course in Computational Semantics , 2005, Computational Linguistics.

[9]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[10]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Weijian Li,et al.  Attentive Relational Networks for Mapping Images to Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Johan Bos,et al.  Combining Lexical and Spatial Knowledge to Predict Spatial Relations between Objects in Images , 2016, VL@ACL.

[14]  Pascual Martínez-Gómez,et al.  Higher-order logical inference with compositional semantics , 2015, EMNLP.

[15]  Sandro Pezzelle,et al.  Comparatives, Quantifiers, Proportions: A Multi-Task Model for the Learning of Quantities from Vision , 2018, NAACL-HLT.

[16]  Christopher Kanan,et al.  TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[17]  Pascual Martínez-Gómez,et al.  On-demand Injection of Lexical Knowledge for Recognising Textual Entailment , 2017, EACL.

[18]  Asim Kadav,et al.  Visual Entailment Task for Visually-Grounded Language Learning , 2018, ArXiv.

[19]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[20]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.