Explaining Visual Classification using Attributes

The performance of deep Convolutional Neural Networks (CNN) has been reaching or even exceeding the human level on large number of tasks. Some examples are image classification, Mastering Go game, speech understanding etc. However, their lack of decomposability into intuitive and understandable components make them hard to interpret, i.e. no information is provided about what makes them arrive at their prediction. We propose a technique to interpret CNN classification task and justify the classification result with visual explanation and visual search. The model consists of two sub networks: a deep recurrent neural network for generating textual justification and a deep convolutional network for image analysis. This multimodal approach generates the textual justification about the classification decision. To verify the textual justification, we use the visual search to extract the similar content from the training set. We evaluate our strategy on a novel CUB dataset with the ground-truth attributes. We make use of these attributes to further strengthen the justification by providing the attributes of images.

[1]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[2]  Derek Hoiem,et al.  Diagnosing Error in Object Detectors , 2012, ECCV.

[3]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Anind K. Dey,et al.  Toolkit to support intelligibility in context-aware applications , 2010, UbiComp.

[6]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Bernard Ghanem,et al.  On the relationship between visual attributes and convolutional networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[11]  Peter N. Belhumeur,et al.  How Do You Tell a Blackbird from a Crow? , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  E. Shortliffe,et al.  An analysis of physician attitudes regarding computer-based clinical consultation systems. , 1981, Computers and biomedical research, an international journal.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[15]  K. Holyoak,et al.  The Oxford handbook of thinking and reasoning , 2012 .

[16]  John Riedl,et al.  Explaining collaborative filtering recommendations , 2000, CSCW '00.

[17]  Judith Masthoff,et al.  A Survey of Explanations in Recommender Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[18]  Edward H. Shortliffe,et al.  A model of inexact reasoning in medicine , 1990 .

[19]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  William R. Swartout,et al.  XPLAIN: A System for Creating and Explaining Expert Consulting Programs , 1983, Artif. Intell..

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[23]  Jo Vermeulen,et al.  Improving intelligibility and control in Ubicomp , 2010, UbiComp '10 Adjunct.

[24]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[25]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[26]  T. Lombrozo Explanation and Abductive Inference , 2012 .

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[31]  Gabriel J. Brostow,et al.  Becoming the expert - interactive multi-class machine teaching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  T. Lombrozo The structure and function of explanations , 2006, Trends in Cognitive Sciences.

[33]  Kai Han,et al.  Attribute-Aware Attention Model for Fine-grained Representation Learning , 2018, ACM Multimedia.

[34]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[35]  Trevor Darrell,et al.  Textual Explanations for Self-Driving Vehicles , 2018, ECCV.

[36]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.