SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition

Explainable artificial intelligence is gaining attention. However, most existing methods are based on gradients or intermediate features, which are not directly involved in the decision-making process of the classifier. In this paper, we propose a slot attention-based light-weighted classifier called SCOUTER for transparent yet accurate classification. Two major differences from other attention-based methods include: (a) SCOUTER's explanation involves the final confidence for each category, offering more intuitive interpretation, and (b) all the categories have their corresponding positive or negative explanation, which tells "why the image is of a certain category" or "why the image is not of a certain category." We design a new loss tailored for SCOUTER that controls the model's behavior to switch between positive and negative explanations, as well as the size of explanatory regions. Experimental results show that SCOUTER can give better visual explanations while keeping good accuracy on a large dataset.

[1]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3]  Chongruo Wu,et al.  ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Theo Gevers,et al.  Con-Text: Text Detection for Fine-Grained Object Classification , 2017, IEEE Transactions on Image Processing.

[5]  Andrea Vedaldi,et al.  Understanding Deep Networks via Extremal Perturbations and Smooth Masks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[7]  Marcel van Gerven,et al.  Explainable Deep Learning: A Field Guide for the Uninitiated , 2020, ArXiv.

[8]  Fuxin Li,et al.  Visualizing Deep Networks by Optimizing with Integrated Gradients , 2019, CVPR Workshops.

[9]  Trevor Darrell,et al.  Generating Counterfactual Explanations with Natural Language , 2018, ICML 2018.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[12]  Valery Naranjo,et al.  CNNs for automatic glaucoma assessment using fundus images: an extensive validation , 2019, BioMedical Engineering OnLine.

[13]  Chih-Kuan Yeh,et al.  On the (In)fidelity and Sensitivity for Explanations. , 2019, 1901.09392.

[14]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[15]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Federico Tombari,et al.  Restricting the Flow: Information Bottlenecks for Attribution , 2020, ICLR.

[18]  Yoshua Bengio,et al.  Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Harish G. Ramaswamy,et al.  Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-free Localization , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[22]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[23]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[24]  Nuno Vasconcelos,et al.  SCOUT: Self-Aware Discriminant Counterfactual Explanations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[26]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[27]  Yarin Gal,et al.  Real Time Image Saliency for Black Box Classifiers , 2017, NIPS.

[28]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Rakshit Naidu,et al.  SS-CAM: Smoothed Score-CAM for Sharper Visual Feature Localization , 2020, ArXiv.

[31]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[32]  Ziyan Wu,et al.  Counterfactual Visual Explanations , 2019, ICML.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Kevin Murphy,et al.  Attention-Based Extraction of Structured Information from Street View Imagery , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Quoc V. Le,et al.  SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Yukun Zhu,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[42]  Zijian Zhang,et al.  Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Quanshi Zhang,et al.  Interpreting CNNs via Decision Trees , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[45]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[46]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Daniel Omeiza,et al.  Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models , 2019, ArXiv.

[48]  Vineeth N. Balasubramanian,et al.  Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[49]  David Mascharka,et al.  Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[51]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[53]  Kate Saenko,et al.  RISE: Randomized Input Sampling for Explanation of Black-box Models , 2018, BMVC.