Deriving Explanation of Deep Visual Saliency Models

Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying human perception theories and the conventional concepts of saliency. This technique helps us understand the learning pattern of the deep network at its intermediate layers through their activation maps. Initially, we consider two state-of-the-art deep saliency models, namely UNISAL and MSINet for our interpretation. We use a set of biologically plausible log-gabor filters for identifying and reconstructing the activation maps of them using our explainable saliency model. The final saliency map is generated using these reconstructed activation maps. We also build our own deep saliency model named crossconcatenated multi-scale residual block based network (CMRNet) for saliency prediction. Then, we evaluate and compare the performance of the explainable models derived from UNISAL, MSI-Net and CMRNet on three benchmark datasets with other state-of-the-art methods. Hence, we propose that this approach of explainability can be applied to any deep visual saliency model for interpretation which makes it a generic one.

[1]  David Duvenaud,et al.  Explaining Image Classifiers by Counterfactual Generation , 2018, ICLR.

[2]  Naila Murray,et al.  Low-Level Spatiochromatic Grouping for Saliency Estimation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Victor Leboran,et al.  Dynamic Whitening Saliency , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Martin D. Levine,et al.  Visual Saliency Based on Scale-Space Analysis in the Frequency Domain , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Fei Yang,et al.  Temporal Spectral Residual for fast salient motion detection , 2012, Neurocomputing.

[6]  Ivan V. Bajic,et al.  Compressed-domain correlates of human fixations in dynamic scenes , 2015, Multimedia Tools and Applications.

[7]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[8]  Kate Saenko,et al.  RISE: Randomized Input Sampling for Explanation of Black-box Models , 2018, BMVC.

[9]  Frédo Durand,et al.  A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[10]  Aykut Erdem,et al.  Visual saliency estimation by nonlinearly integrating features using region covariances. , 2013, Journal of vision.

[11]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[12]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[17]  Arnold W. M. Smeulders,et al.  Interpreting Adversarial Examples with Attributes , 2019, ArXiv.

[18]  Y. Frégnac,et al.  From Sensation to Perception , 2010 .

[19]  Nicolas Riche,et al.  Rare: A new bottom-up saliency model , 2012, 2012 19th IEEE International Conference on Image Processing.

[20]  Junwei Han,et al.  A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection , 2016, IEEE Transactions on Image Processing.

[21]  Victor A. F. Lamme,et al.  Contextual Modulation in Primary Visual Cortex , 1996, The Journal of Neuroscience.

[22]  Matthew H Tong,et al.  of the Annual Meeting of the Cognitive Science Society Title SUNDAy : Saliency Using Natural Statistics for Dynamic Analysis of Scenes Permalink , 2009 .

[23]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[24]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[25]  Weisi Lin,et al.  A Dilated Inception Network for Visual Saliency Prediction , 2019, IEEE Transactions on Multimedia.

[26]  Arni Kristjansson,et al.  Efficient visual search without top-down or bottom-up guidance , 2005, Perception & psychophysics.

[27]  Xuelong Li,et al.  Visual saliency detection using information divergence , 2013, Pattern Recognit..

[28]  Rainer Goebel,et al.  Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[29]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Ali Borji,et al.  Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[33]  An HVS-Oriented Saliency Map Prediction Modeling , 2020, ArXiv.

[34]  Ali Borji,et al.  Understanding and Visualizing Deep Visual Saliency Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[37]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[39]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[40]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[41]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.