Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

[1]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[2]  Byron C. Wallace,et al.  Learning to Faithfully Rationalize by Construction , 2020, ACL.

[3]  Mark O. Riedl,et al.  Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations , 2017, AIES.

[4]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[5]  Licheng Yu,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[8]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[9]  Jakob Grue Simonsen,et al.  Generating Fact Checking Explanations , 2020, ACL.

[10]  Daniel Khashabi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, EMNLP.

[11]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[14]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[15]  Vedanuj Goswami,et al.  Are we pretraining it right? Digging deeper into visio-linguistic pretraining , 2020, ArXiv.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[20]  Christopher D. Manning,et al.  Do Massively Pretrained Language Models Make Better Storytellers? , 2019, CoNLL.

[21]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[22]  Alec Radford,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[23]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[24]  Alexander M. Rush,et al.  Encoder-Agnostic Adaptation for Conditional Language Generation , 2019, ArXiv.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Thomas Lukasiewicz,et al.  e-SNLI-VE-2.0: Corrected Visual-Textual Entailment with Natural Language Explanations , 2020, ArXiv.

[28]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[30]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[31]  Trevor Darrell,et al.  Textual Explanations for Self-Driving Vehicles , 2018, ECCV.

[32]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[33]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[35]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[36]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[37]  Ali Farhadi,et al.  Grounded Situation Recognition , 2020, ECCV.

[38]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[41]  Mark O. Riedl,et al.  Automated rationale generation: a technique for explainable AI and its effects on human perceptions , 2019, IUI.

[42]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[43]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[47]  Raymond Fok,et al.  Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance , 2020, CHI.

[48]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[49]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[50]  Jonathan Berant,et al.  Explaining Question Answering Models through Text Generation , 2020, ArXiv.

[51]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[53]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[54]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[55]  Sawan Kumar,et al.  NILE : Natural Language Inference with Faithful Natural Language Explanations , 2020, ACL.

[56]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[57]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[58]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[59]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[60]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[61]  Antonio Torralba,et al.  Predicting Motivations of Actions by Leveraging Text , 2014, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Yejin Choi,et al.  Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image , 2020, ArXiv.

[63]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[64]  Mireia Ribera,et al.  Can we do better explanations? A proposal of user-centered explainable AI , 2019, IUI Workshops.

[65]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Jianfei Cai,et al.  VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions , 2018, ECCV.

[67]  Albert Gatt,et al.  Grounded Textual Entailment , 2018, COLING.

[68]  Bernease Herman,et al.  The Promise and Peril of Human Evaluation for Model Interpretability , 2017, ArXiv.

[69]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[70]  Raymond J. Mooney,et al.  Faithful Multimodal Explanation for Visual Question Answering , 2018, BlackboxNLP@ACL.

[71]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[72]  Yejin Choi,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[74]  Colin Raffel,et al.  WT5?! Training Text-to-Text Models to Explain their Predictions , 2020, ArXiv.

[75]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[76]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[77]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.