Ensembling Visual Explanations

Many machine learning systems deployed for real-world applications such as recommender systems, image captioning, object detection, etc. are ensembles of multiple models. Also, the top-ranked systems in many data-mining and computer vision competitions use ensembles. Although ensembles are popular, they are opaque and hard to interpret. Explanations make AI systems more transparent and also justify their predictions. However, there has been little work on generating explanations for ensembles. In this chapter, we propose two new methods for ensembling visual explanations for VQA using the localization maps for the component systems. Our novel approach is scalable with the number of component models in the ensemble. Evaluating explanations is also a challenging research problem. We introduce two new approaches to evaluate explanations—the comparison metric and the uncovering metric. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is significantly qualitatively outperform each of the individual system’s visual explanation. Overall, our ensemble explanation is better 61% of the time when compared to any individual system’s explanation and is also sufficient for humans to arrive at the correct answer, just based on the explanation, at least 64% of the time.

[1]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[4]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[5]  Andrew Slavin Ross,et al.  Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , 2017, IJCAI.

[6]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[7]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[8]  Gabriel J. Brostow,et al.  Becoming the expert - interactive multi-class machine teaching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[10]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yash Goyal,et al.  Towards Transparent AI Systems: Interpreting Visual Question Answering Models , 2016, 1608.08974.

[12]  Raymond J. Mooney,et al.  Stacked Ensembles of Information Extractors for Knowledge-Base Population , 2015, ACL.

[13]  Raymond J. Mooney,et al.  Stacking With Auxiliary Features , 2016, IJCAI.

[14]  Trevor Darrell,et al.  Attentive Explanations: Justifying Decisions and Pointing to the Evidence , 2016, ArXiv.

[15]  Peter N. Belhumeur,et al.  How Do You Tell a Blackbird from a Crow? , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[17]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[18]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[19]  Raymond J. Mooney,et al.  Combining Supervised and Unsupervised Enembles for Knowledge Base Population , 2016, EMNLP.

[20]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[21]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Alex Fridman,et al.  DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning , 2018, ArXiv.

[23]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[29]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[30]  Raymond J. Mooney,et al.  Stacking with Auxiliary Features for Visual Question Answering , 2018, NAACL.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Christopher Kanan,et al.  Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.