Semantic Concept Network and Deep Walk-based Visual Question Answering

Visual Question Answering (VQA) is a hot-spot in the intersection of computer vision and natural language processing research and its progress has enabled many in high-level applications. This work aims to describe a novel VQA model based on semantic concept network construction and deep walk. Extracting visual image semantic representation is a significant and effective method for spanning the semantic gap. Moreover, current research has shown that co-occurrence patterns of concepts can enhance semantic representation. This work is motivated by the challenge that semantic concepts have complex interrelations and the relationships are similar to a network. Therefore, we construct a semantic concept network adopted by leveraging Word Activation Forces (WAFs), and mine the co-occurrence patterns of semantic concepts using deep walk. Then the model performs polynomial logistic regression on the basis of the extracted deep walk vector along with the visual image feature and question feature. The proposed model effectively integrates visual and semantic features of the image and natural language question. The experimental results show that our algorithm outperforms competitive baselines on three benchmark image QA datasets. Furthermore, through experiments in image annotation refinement and semantic analysis on pre-labeled LabelMe dataset, we test and verify the effectiveness of our constructed concept network for mining concept co-occurrence patterns, sensible concept clusters, and hierarchies.

[1]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[2]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  MengChu Zhou,et al.  An Efficient Non-Negative Matrix-Factorization-Based Approach to Collaborative Filtering for Recommender Systems , 2014, IEEE Transactions on Industrial Informatics.

[5]  Michele Nappi,et al.  Biometric surveillance using visual question answering , 2019, Pattern Recognit. Lett..

[6]  Qi Wu,et al.  Image Captioning with an Intermediate Attributes Layer , 2015, ArXiv.

[7]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shuicheng Yan,et al.  A Focused Dynamic Attention Model for Visual Question Answering , 2016, ArXiv.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Feng Zhang,et al.  CloudNet: Ground‐Based Cloud Classification With Deep Convolutional Neural Network , 2018, Geophysical Research Letters.

[13]  Qi Tian,et al.  Sequential Video VLAD: Training the Aggregation Locally and Temporally , 2018, IEEE Transactions on Image Processing.

[14]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Christopher Kanan,et al.  Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xuelong Li,et al.  Image Annotation by Multiple-Instance Learning With Discriminative Feature Mapping and Selection , 2014, IEEE Transactions on Cybernetics.

[19]  Chong-Wah Ngo,et al.  Semantic context modeling with maximal margin Conditional Random Fields for automatic image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Meng Wang,et al.  Multi-View Object Retrieval via Multi-Scale Topic Models , 2016, IEEE Transactions on Image Processing.

[21]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[25]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[26]  Venkatesh Saligrama,et al.  A Novel Visual Word Co-occurrence Model for Person Re-identification , 2014, ECCV Workshops.

[27]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[28]  Bir Bhanu,et al.  Semantic Concept Co-Occurrence Patterns for Image Annotation and Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Hung-Khoon Tan,et al.  Beyond search: Event-driven summarization for web videos , 2011, TOMCCAP.

[30]  S. K. Kolluru,et al.  CognitiveCam: A Visual Question Answering Application , 2017 .

[31]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[32]  Ning Zhou,et al.  A Hybrid Probabilistic Model for Unified Collaborative and Content-Based Image Tagging , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Tamir Hazan,et al.  High-Order Attention Models for Visual Question Answering , 2017, NIPS.

[34]  Tao Mei,et al.  Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions , 2018, EMNLP.

[35]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[36]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[37]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[38]  Meng Wang,et al.  Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[39]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[40]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[41]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[43]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[44]  Jun Guo,et al.  An Activation Force-based Affinity Measure for Analyzing Complex Networks , 2011, Scientific reports.

[45]  Subhransu Maji,et al.  Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[47]  Xiao Lin,et al.  Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[49]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jinhui Tang,et al.  Weakly Supervised Deep Metric Learning for Community-Contributed Image Retrieval , 2015, IEEE Transactions on Multimedia.

[51]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[52]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Jinhui Tang,et al.  Weakly Supervised Deep Matrix Factorization for Social Image Understanding , 2017, IEEE Transactions on Image Processing.

[55]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[56]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).