Exploiting hierarchical visual features for visual question answering

Abstract Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[8]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[9]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xiaogang Wang,et al.  Question-Guided Hybrid Convolution for Visual Question Answering , 2018, ECCV.

[13]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Qi Wu,et al.  Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[15]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Jianping Fan,et al.  Embedding Visual Hierarchy With Deep Networks for Large-Scale Visual Recognition , 2017, IEEE Transactions on Image Processing.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[23]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Silvio Savarese,et al.  Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Robert H. Deng,et al.  On robust image spam filtering via comprehensive visual modeling , 2015, Pattern Recognit..

[28]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[32]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yann LeCun,et al.  Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[35]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[36]  Jianping Fan,et al.  Hierarchical Learning of Tree Classifiers for Large-Scale Plant Species Identification , 2015, IEEE Transactions on Image Processing.

[37]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[38]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Li Deng,et al.  Deep Learning for Image-to-Text Generation: A Technical Overview , 2017, IEEE Signal Processing Magazine.

[40]  Jian Yang,et al.  UP-CNN: Un-pooling augmented convolutional neural network , 2017, Pattern Recognit. Lett..

[41]  Paolo Remagnino,et al.  How deep learning extracts and learns leaf features for plant classification , 2017, Pattern Recognit..

[42]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[43]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.