Object-Based Reasoning in VQA

Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[3]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ole Winther,et al.  Recurrent Relational Networks for Complex Relational Reasoning , 2018, ArXiv.

[5]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[6]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Richard Socher,et al.  Interpretable Counting for Visual Question Answering , 2017, ICLR.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qi Wu,et al.  Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[13]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[14]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[16]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[17]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[18]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[19]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[28]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[29]  Dan Klein,et al.  Deep Compositional Question Answering with Neural Module Networks , 2015, ArXiv.

[30]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[31]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[32]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[33]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mario Fritz,et al.  Ask Your Neurons: A Deep Learning Approach to Visual Question Answering , 2016, International Journal of Computer Vision.

[35]  Qi Wu,et al.  The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[39]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[42]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[43]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[44]  Mario Fritz,et al.  Towards a Visual Turing Challenge , 2014, ArXiv.

[45]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[48]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[49]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..