Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form (subject, predicate, object) that can be easily translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.

[1]  Akikazu Takeuchi,et al.  STAIR Actions: A Video Dataset of Everyday Home Actions , 2018, ArXiv.

[2]  Johan Bos,et al.  Combining Lexical and Spatial Knowledge to Predict Spatial Relations between Objects in Images , 2016, VL@ACL.

[3]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Koji Mineshima,et al.  Multimodal Logical Inference System for Visual-Textual Entailment , 2019, ACL.

[5]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[6]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Thomas Lukasiewicz,et al.  e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations , 2020, 2004.03744.

[8]  Christopher Kanan,et al.  TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[9]  Richard Spencer-Smith,et al.  Modal Logic , 2007 .

[10]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jingzhou Liu,et al.  Violin: A Large-Scale Dataset for Video-and-Language Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[16]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.