iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Causality knowledge is vital to building robust AI systems. Deep learning models often perform poorly on tasks that require causal reasoning, which is often derived using some form of commonsense knowledge not immediately available in the input but implicitly inferred by humans. Prior work has unraveled spurious observational biases that models fall prey to in the absence of causality. While language representation models preserve contextual knowledge within learned embeddings, they do not factor in causal relationships during training. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video questionanswering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Recently, several models have been proposed that have tackled the task of mining causal data from either the visual or textual modality. However, there does not exist widespread prevalent research that mines causal relationships by juxtaposing the visual and language modalities. While images offer a rich and easy-to-process resource for us to mine causality knowledge from, videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. As such, we propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions. Furthermore, iReason’s architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. We demonstrate the effectiveness of iReason using a two-pronged comparative analysis with language representation learning models (BERT, GPT-2) as well as current state-of-the-art multimodal causality models. Finally, we present case-studies attesting to the universal applicability of iReason by incorporating the “causal signal” in a range of downstream cognition tasks such as dense video captioning, video question-answering and scene understanding and show that iReason outperforms the state-of-the-art.

[1]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Fergal Cotter,et al.  Probabilistic Future Prediction for Video Scene Understanding , 2020, ECCV.

[3]  Juan Carlos Niebles,et al.  Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[5]  Pietro Perona,et al.  Visual Causal Feature Learning , 2014, UAI.

[6]  Li Fei-Fei,et al.  Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[7]  Joyce Yue Chai,et al.  Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches , 2019, ArXiv.

[8]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yejin Choi,et al.  Commonsense Reasoning for Natural Language Processing , 2020, ACL.

[10]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[11]  Hanwang Zhang,et al.  Two Causal Principles for Improving Visual Dialog , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[13]  C. Lawrence Zitnick,et al.  Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Hanwang Zhang,et al.  More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vlado Menkovski,et al.  Causal Discovery from Incomplete Data: A Deep Learning Approach , 2020, ArXiv.

[17]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[18]  Wei Liu,et al.  Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[21]  Aman Chadha,et al.  iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering , 2020, ArXiv.

[22]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[23]  Alexandros G. Dimakis,et al.  CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training , 2017, ICLR.

[24]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[25]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[26]  In So Kweon,et al.  Video Panoptic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Larry S. Davis,et al.  Explicit Bias Discovery in Visual Question Answering Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  J. Pearl Interpretation and Identification of Causal Mediation , 2013, Psychological methods.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Hongming Zhang,et al.  Learning Contextual Causality from Time-consecutive Images , 2020, ArXiv.

[34]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Ali Farhadi,et al.  Stating the Obvious: Extracting Visual Common Sense Knowledge , 2016, NAACL.

[36]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[37]  Xiao Lin,et al.  Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Silvio Savarese,et al.  Causal Induction from Visual Observations for Goal Directed Tasks , 2019, ArXiv.

[39]  Bernhard Scholkopf Causality for Machine Learning , 2019 .

[40]  Yutaka Kidawara,et al.  Toward Future Scenario Generation: Extracting Event Causality Exploiting Semantic Relation, Context, and Association Features , 2014, ACL.

[41]  Chitta Baral,et al.  From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge , 2015, ArXiv.

[42]  Hao Wu,et al.  Joint Reasoning for Temporal and Causal Relations , 2018, ACL.

[43]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[45]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[46]  Bernhard Schölkopf,et al.  Discovering Causal Signals in Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Ben Goertzel,et al.  Artificial General Intelligence: Concept, State of the Art, and Future Prospects , 2009, J. Artif. Gen. Intell..

[48]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[50]  Chitta Baral,et al.  Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning , 2020, EMNLP.

[51]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[52]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[53]  Jong-Hoon Oh,et al.  Why-Question Answering using Intra- and Inter-Sentential Causal Relations , 2013, ACL.

[54]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).