Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and coattention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with coattention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model’s input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure selfattention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability. Our code is available at: https://github.com/hila-chefer/ Transformer-MM-Explainability .

[1]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[2]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[3]  Yann LeCun,et al.  Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors , 2021, DEELIO.

[4]  Andrea Vedaldi,et al.  Understanding Deep Networks via Extremal Perturbations and Smooth Masks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[6]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[7]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[8]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[9]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[10]  Bolei Zhou,et al.  Interpreting Deep Visual Representations via Network Dissection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[12]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[13]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[14]  Jaesik Choi,et al.  Relative Attributing Propagation: Interpreting the Comparative Contributions of Individual Units in Deep Neural Networks , 2020, AAAI.

[15]  Andrea Vedaldi,et al.  Visualizing Deep Convolutional Neural Networks Using Natural Pre-images , 2015, International Journal of Computer Vision.

[16]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[17]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[20]  Volker Tresp,et al.  Understanding Individual Decisions of CNNs via Contrastive Backpropagation , 2018, ACCV.

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[23]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[24]  Luc Van Gool,et al.  Local Memory Attention for Fast Video Semantic Segmentation , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[26]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[28]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Brian Kenji Iwana,et al.  Explaining Convolutional Neural Networks using Softmax Gradient Layer-wise Relevance Propagation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[30]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[31]  Yarin Gal,et al.  Real Time Image Saliency for Black Box Classifiers , 2017, NIPS.

[32]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[34]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[35]  Francois Fleuret,et al.  Full-Gradient Representation for Neural Network Visualization , 2019, NeurIPS.

[36]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[37]  Lior Wolf,et al.  Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization , 2020, ArXiv.

[38]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[39]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.

[42]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[44]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Le Song,et al.  L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data , 2018, ICLR.

[46]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[47]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[48]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[49]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.