Answering Questions in Natural Language About Images Using Deep Learning

Visual Question Answering is a perfect mix of issues enveloping different spaces including Natural Language Processing, Computer Vision and knowledge portrayal. The problem involves giving an image and a natural language question as an input to the computer, process them together and give an accurate answer to the question in the context of the image as the output. The answer can be a single word answer, phrase or sentence depending on the question and the image. We explore the various approaches used by global teams to deal with this problem and the specifications of the publicly available dataset in order to analyze the feasibility and scope of this domain. This technology finds its use in helping blind people in object recognition using voice commands. It may also be used by physicians and medical practitioners to confirm or validate their diagnosis about medical imagery. Since this field is relatively new the possibilities are endless when it comes to datasets, algorithms and accuracy achieved. We aim at understanding this expanse of possibilities at hand and develop conclusive ideas about its further growth.

[1]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[2]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[3]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[4]  Jeffrey P. Bigham,et al.  VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[5]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[6]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[7]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sanja Fidler,et al.  Monocular Object Instance Segmentation and Depth Ordering with CNNs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Christopher Kanan,et al.  Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Jonathan Masci,et al.  Learning shape correspondence with anisotropic convolutional neural networks , 2016, NIPS.

[13]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[14]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[16]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.