Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks

Learning-based representations have become the defacto means to address computer vision tasks. Despite their massive adoption, the amount of work aiming at understanding the internal representations learned by these models is rather limited. Existing methods aimed at model interpretation either require exhaustive manual inspection of visualizations, or link internal network activations with external "possibly useful" annotated concepts. We propose an intermediate scheme in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without requiring additional annotations. We interpret the model through average visualizations of these features. Then, at test time, we explain the network prediction by accompanying the predicted class label with supporting heatmap visualizations derived from the identified relevant features. In addition, we propose a method to address the artifacts introduced by strided operations in deconvnet-based visualizations. Our evaluation on the MNIST, ILSVRC 12 and Fashion 144k datasets quantitatively shows that the proposed method is able to identify relevant internal features for the classes of interest while improving the quality of the produced visualizations.

[1]  Philip H. S. Torr,et al.  Multi-agent Diverse Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Nassir Navab,et al.  A Taxonomy and Library for Visualizing Learned Features in Convolutional Neural Networks , 2016, ArXiv.

[3]  Yao Li,et al.  Mining Mid-level Visual Patterns with Deep CNN Activations , 2015, International Journal of Computer Vision.

[4]  Andrea Vedaldi,et al.  Net2Vec: Quantifying and Explaining How Concepts are Encoded by Filters in Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[7]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[8]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[12]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[13]  Quanshi Zhang,et al.  Interpretable Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Xiaogang Wang,et al.  Crafting GBD-Net for Object Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Davide Modolo,et al.  Do Semantic Parts Emerge in Convolutional Neural Networks? , 2016, International Journal of Computer Vision.

[17]  Bernard Ghanem,et al.  On the relationship between visual attributes and convolutional networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ilya Kostrikov,et al.  PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[19]  Pascal Hitzler,et al.  Relating Input Concepts to Convolutional Neural Network Decisions , 2017, ArXiv.

[20]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Luc Van Gool,et al.  An Analysis of Human-Centered Geolocation , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Vineeth N. Balasubramanian,et al.  Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[25]  Francesc Moreno-Noguer,et al.  Neuroaesthetics in fashion: Modeling the perception of fashionability , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[29]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[31]  Frank Dellaert,et al.  Dataset fingerprints: Exploring image collections through data mining , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[33]  Yang Zhang,et al.  A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations , 2018, ICML.

[34]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[35]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[36]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[37]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[38]  Tinne Tuytelaars,et al.  Modeling Visual Compatibility through Hierarchical Mid-level Elements , 2016, ArXiv.