Explain and improve: LRP-inference fine-tuning for image captioning models

Abstract This paper analyzes the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. We develop variants of layer-wise relevance propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations provided by explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanations (supporting and opposing pixels of the input image) and linguistic explanations (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods (1) can reveal additional evidence used by the model to make decisions compared to attention; (2) correlate to object locations with high precision; (3) are helpful to “debug” the model, e.g. by analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that reduces the issue of object hallucination in image captioning models, and meanwhile, maintains the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention mechanism calculated with the scaled dot product.

[1]  Rita Cucchiara,et al.  Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fenglin Liu,et al.  Exploring and Distilling Cross-Modal Information for Image Captioning , 2019, IJCAI.

[3]  Max Welling,et al.  Visualizing Deep Neural Network Decisions: Prediction Difference Analysis , 2017, ICLR.

[4]  Klaus-Robert Müller,et al.  Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications , 2021, Proceedings of the IEEE.

[5]  Amit Dhurandhar,et al.  Generating Contrastive Explanations with Monotonic Attribute Functions , 2019, ArXiv.

[6]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[7]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Alexander Binder,et al.  Generalized PatternAttribution for Neural Networks with Sigmoid Activations , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[9]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Xiaoxiao Li,et al.  Explain Graph Neural Networks to Understand Weighted Graph Features in Node Classification , 2020, CD-MAKE.

[11]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[13]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[14]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Fine-Grained Image Captioning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Klaus-Robert Müller,et al.  Layer-Wise Relevance Propagation: An Overview , 2019, Explainable AI.

[17]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[18]  Shinichi Nakajima,et al.  Towards Best Practice in Explaining Neural Network Decisions with LRP , 2019, 2020 International Joint Conference on Neural Networks (IJCNN).

[19]  Jie Chen,et al.  Adaptively Aligned Image Captioning via Adaptive Attention Time , 2019, NeurIPS.

[20]  Moses Soh Learning Cnn Lstm Architectures For Image Caption Generation , 2016 .

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[24]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Kate Saenko,et al.  Top-Down Visual Saliency Guided by Captions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[29]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[30]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[31]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[32]  Dimosthenis Karatzas,et al.  Good News, Everyone! Context Driven Entity-Aware Captioning for News Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  M. Yamada,et al.  GraphLIME: Local Interpretable Model Explanations for Graph Neural Networks , 2020, IEEE Transactions on Knowledge and Data Engineering.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Wojciech Samek,et al.  Analyzing ImageNet with Spectral Relevance Analysis: Towards ImageNet un-Hans'ed , 2019, ArXiv.

[36]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[37]  Kerstin Ritter,et al.  Testing the robustness of attribution methods for convolutional neural networks in MRI-based Alzheimer's disease classification , 2019, iMIMIC/ML-CDS@MICCAI.

[38]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[39]  Wojciech Samek,et al.  Finding and removing Clever Hans: Using explanation methods to debug and improve deep models , 2019, Inf. Fusion.

[40]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Yi Yang,et al.  Cascaded Revision Network for Novel Object Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[44]  Xiaodan Zhang,et al.  Spatio-Temporal Memory Attention for Image Captioning , 2020, IEEE Transactions on Image Processing.

[45]  Zi Huang,et al.  Human Consensus-Oriented Image Captioning , 2020, IJCAI.

[46]  François Auger,et al.  On the Use of Concentrated Time–Frequency Representations as Input to a Deep Convolutional Neural Network: Application to Non Intrusive Load Monitoring , 2020, Entropy.

[47]  Hanqing Lu,et al.  MSCap: Multi-Style Image Captioning With Unpaired Stylized Text , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[50]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[51]  Arti Ramesh,et al.  RelEx: A Model-Agnostic Relational Model Explainer , 2020, AIES.

[52]  Jianfeng Gao,et al.  VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[53]  Klaus-Robert Müller,et al.  Resolving challenges in deep learning-based analyses of histopathological images using explanation methods , 2019, Scientific Reports.

[54]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[55]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[56]  Freddy Lecue,et al.  Trustworthy Convolutional Neural Networks: A Gradient Penalized-based Approach , 2020, ArXiv.

[57]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[60]  Jure Leskovec,et al.  GNNExplainer: Generating Explanations for Graph Neural Networks , 2019, NeurIPS.

[61]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[62]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[64]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[65]  Yixin Chen,et al.  SHOW , 2018, Silent Cinema.

[66]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[67]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Klaus-Robert Müller,et al.  From Clustering to Cluster Explanations via Neural Networks , 2019, IEEE transactions on neural networks and learning systems.

[69]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[71]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[72]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[73]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[74]  Trevor Darrell,et al.  Object Hallucination in Image Captioning , 2018, EMNLP.

[75]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[78]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Klaus-Robert Müller,et al.  Evaluating Recurrent Neural Network Explanations , 2019, BlackboxNLP@ACL.

[80]  Haifeng Hu,et al.  Hierarchical Attention Network for Image Captioning , 2019, AAAI.

[81]  Hongxia Jin,et al.  Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Klaus-Robert Müller,et al.  Learning how to explain neural networks: PatternNet and PatternAttribution , 2017, ICLR.

[83]  Peng Wang,et al.  Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Bin Yu,et al.  Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs , 2018, ICLR.

[85]  Shinichi Nakajima,et al.  XAI for Graphs: Explaining Graph Neural Network Predictions by Identifying Relevant Walks , 2020, ArXiv.

[86]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[87]  Alexander Binder,et al.  Unmasking Clever Hans predictors and assessing what machines really learn , 2019, Nature Communications.

[88]  Klaus-Robert Müller,et al.  Explaining Recurrent Neural Network Predictions in Sentiment Analysis , 2017, WASSA@EMNLP.

[89]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[90]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[91]  Kate Saenko,et al.  RISE: Randomized Input Sampling for Explanation of Black-box Models , 2018, BMVC.

[92]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[96]  Shirley A Smoyak,et al.  Good news. , 2010, Journal of psychosocial nursing and mental health services.

[97]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Alexander Binder,et al.  Explanation-Guided Training for Cross-Domain Few-Shot Classification , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[99]  Zirui Liu,et al.  Mitigating Gender Bias in Captioning Systems , 2021, WWW.

[100]  Xu Zhou,et al.  Improving Image Captioning with Better Use of Caption , 2020, ACL.

[101]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[102]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.