Visual Question Answering : Datasets , Methods , Challenges and Oppurtunities

Recent advances in machine perception motivated the researchers to tackle more complex problems in this area. One of the most challenging and interesting tasks that recently attracted the attention of researchers from various fields of AI is the Visual Question Answering, or VQA. In this task, the input is an image and an open-ended question about the image, and the output is an open-ended answer to the question with respect to the image. In order to achieve good results in this task, the expertise in various fields such as computer vision and natural language processing is needed. Various datasets have been collected for QA task. In addition, several methods have been proposed in the literature in the past recent years. In this survey, we explain the most famous datasets, as well as the state-ofthe-art methods for VQA task. We compare the results of different methods, and identify the promising approaches for future research in this area. Identifying the shortcomings of current methods, as well as the future opportunities for VQA are also parts of this survey.

[1]  Mario Fritz,et al.  Towards a Visual Turing Challenge , 2014, ArXiv.

[2]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Richard Socher,et al.  Interpretable Counting for Visual Question Answering , 2017, ICLR.

[7]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[8]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[9]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[14]  Martial Hebert,et al.  Learning by Asking Questions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[16]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[17]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Jiashi Feng,et al.  Multimodal Learning and Reasoning for Visual Question Answering , 2017, NIPS.

[20]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.