Visual Explanations from Hadamard Product in Multimodal Deep Networks

The visual explanation of learned representation of models helps to understand the fundamentals of learning. The attentional models of previous works used to visualize the attended regions over an image or text using their learned weights to confirm their intended mechanism. Kim et al. (2016) show that the Hadamard product in multimodal deep networks, which is well-known for the joint function of visual question answering tasks, implicitly performs an attentional mechanism for visual inputs. In this work, we extend their work to show that the Hadamard product in multimodal deep networks performs not only for visual inputs but also for textual inputs simultaneously using the proposed gradient-based visualization technique. The attentional effect of Hadamard product is visualized for both visual and textual inputs by analyzing the two inputs and an output of the Hadamard product with the proposed method and compared with learned attentional weights of a visual question answering model.

[1]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[4]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[5]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[6]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[7]  Ying Zhang,et al.  On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[8]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[9]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[11]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[14]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).