An Improved Attention and Hybrid Optimization Technique for Visual Question Answering

In Visual Question Answering (VQA), an attention mechanism has a critical role in specifying the different objects present in an image or tells the machine where to focus by invoking the vivid visuals in an image. However, the current VQA systems uses the region-based or bounding box based image features to learn the attention distribution, which are not capable enough to answer the questions related to foreground or background object present in an image. In this paper, we have proposed a VQA model that uses image features which are effective enough to answer questions related to foreground object and background region. Also, we have used graph neural network to encode the relationship between the image regions and objects in an image. Further, we have generated image captions based on these visual relationship based image representation. Thus, the proposed model uses two attention modules to take advantage of each other’s knowledge, to generate more influential attention modules together with the captions based image representation to extract the features which are capable enough to answer questions related to the foreground object and background region. Finally, the performance of proposed architecture is improved by combining the hybrid simulated annealing-Mantaray Foraging Optimization (SA-MRFO) algorithm, which selects the optimal weight parameter for the proposed model. To estimate the performance of the proposed model, two benchmark datasets are used: VQA 2.0 and VQA-CP v2.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Albert Gatt,et al.  Where to put the image in an image caption generator , 2017, Natural Language Engineering.

[8]  Carl Doersch,et al.  Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[9]  Wei Zhang,et al.  R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.

[10]  Feiran Huang,et al.  Visual question answering via Attention-based syntactic structure tree-LSTM , 2019, Appl. Soft Comput..

[11]  Yuling Xi,et al.  Stimulus-driven and concept-driven analysis for image caption generation , 2020, Neurocomputing.

[12]  Farhad Shakerin,et al.  AQuA: ASP-Based Visual Question Answering , 2020, PADL.

[13]  Hany M. Hasanien,et al.  Optimal design of automatic voltage regulation controller using hybrid simulated annealing – Manta ray foraging optimization algorithm , 2020 .

[14]  Zhoujun Li,et al.  ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering , 2020, IEEE Transactions on Cybernetics.

[15]  Anand Singh Jalal,et al.  Incorporating external knowledge for image captioning using CNN and LSTM , 2020 .

[16]  Jingkuan Song,et al.  Question-Led object attention for visual question answering , 2020, Neurocomputing.

[17]  Jingkuan Song,et al.  Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Weifeng Zhang,et al.  Multimodal deep fusion for image question answering , 2021, Knowl. Based Syst..

[19]  Anand Singh Jalal,et al.  Visual question answering model based on graph neural network and contextual attention , 2021, Image Vis. Comput..

[20]  Marcin Woźniak,et al.  DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression , 2021, Pattern Recognit..