Interpreting Vision and Language Generative Models with Semantic Visual Priors

When applied to Image-to-text models, interpretability methods often provide token-by-token explanations namely, they compute a visual explanation for each token of the generated sequence. Those explanations are expensive to compute and unable to comprehensively explain the model's output. Therefore, these models often require some sort of approximation that eventually leads to misleading explanations. We develop a framework based on SHAP, that allows for generating comprehensive, meaningful explanations leveraging the meaning representation of the output sequence as a whole. Moreover, by exploiting semantic priors in the visual backbone, we extract an arbitrary number of features that allows the efficient computation of Shapley values on large-scale models, generating at the same time highly meaningful visual explanations. We demonstrate that our method generates semantically more expressive explanations than traditional methods at a lower compute cost and that it can be generalized over other explainability methods.

[1]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Kees van Deemter,et al.  HL Dataset: Grounding High-Level Linguistic Concepts in Vision , 2023, ArXiv.

[3]  C. Seifert,et al.  From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , 2022, ACM Comput. Surv..

[4]  Jeremias Sulam,et al.  Fast Hierarchical Games for Image Explanations , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  A. Frank,et al.  MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks , 2022, ArXiv.

[6]  Been Kim,et al.  Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation , 2022, ICLR.

[7]  Tristan Thrush,et al.  Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Liqiang Nie,et al.  Image-text Retrieval: A Survey on Recent Research and Development , 2022, IJCAI.

[9]  W. Freeman,et al.  Unsupervised Semantic Segmentation by Distilling Feature Correspondences , 2022, ICLR.

[10]  Deyu Li,et al.  Attention-based explainable friend link prediction with heterogeneous context information , 2022, Inf. Sci..

[11]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[12]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[13]  Anette Frank,et al.  VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena , 2021, ACL.

[14]  Marcella Cornia,et al.  Explaining transformer-based image captioning models: An empirical analysis , 2021, AI Commun..

[15]  E. Mosca,et al.  SHAP-Based Explanation Methods: A Review for NLP Interpretability , 2022, COLING.

[16]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[17]  Gautam Srivastava,et al.  Fuzzy Explainable Attention-based Deep Active Learning on Mental-Health Data , 2021, 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[18]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chen Li,et al.  A Comprehensive Review of Markov Random Field and Conditional Random Field Approaches in Pathology Image Analysis , 2021, Archives of Computational Methods in Engineering.

[20]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  H. Nagahara,et al.  SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Senja Pollak,et al.  BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers , 2021, HACKASHOP.

[24]  Ming-Wei Chang,et al.  CapWAP: Captioning with a Purpose , 2020, EMNLP.

[25]  Yejin Choi,et al.  VisualCOMET: Reasoning About the Dynamic Context of a Still Image , 2020, ECCV.

[26]  Yangfeng Ji,et al.  Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection , 2020, ACL.

[27]  Ravi Kumar Mishra,et al.  Image Captioning: A Comprehensive Survey , 2020, 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC).

[28]  S. Dubey,et al.  Visual Question Answering using Deep Learning: A Survey and Performance Analysis , 2019, CVIP.

[29]  Mani B. Srivastava,et al.  How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods , 2020, NeurIPS.

[30]  Octavio Loyola-González,et al.  Black-Box vs. White-Box: Understanding Their Advantages and Weaknesses From a Practical Point of View , 2019, IEEE Access.

[31]  Benedikt T. Boenninghoff,et al.  Explainable Authorship Verification in Social Media via Attention-based Similarity Learning , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[32]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[33]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[34]  Feng Gao,et al.  RAVEN: A Dataset for Relational and Analogical Visual REasoNing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[37]  Juan Carlos Niebles,et al.  Interpretable Visual Question Answering by Visual Grounding From Attention Supervision Mining , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Aaron J. Fisher,et al.  All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously , 2018, J. Mach. Learn. Res..

[39]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[40]  Gary Klein,et al.  Metrics for Explainable AI: Challenges and Prospects , 2018, ArXiv.

[41]  Sabine Süsstrunk,et al.  Deep Feature Factorization For Concept Discovery , 2018, ECCV.

[42]  Kate Saenko,et al.  RISE: Randomized Input Sampling for Explanation of Black-box Models , 2018, BMVC.

[43]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Alessandro Rinaldo,et al.  Distribution-Free Predictive Inference for Regression , 2016, Journal of the American Statistical Association.

[45]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[46]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[47]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[48]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[50]  Carlos Guestrin,et al.  Model-Agnostic Interpretability of Machine Learning , 2016, ArXiv.

[51]  Anna Shcherbina,et al.  Not Just a Black Box: Learning Important Features Through Propagating Activation Differences , 2016, ArXiv.

[52]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Alexander Binder,et al.  Layer-Wise Relevance Propagation for Deep Neural Network Architectures , 2016 .

[54]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[56]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[57]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[58]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[59]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[60]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[61]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[62]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[63]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[64]  L. Shapley A Value for n-person Games , 1988 .