Multi-view Visual Question Answering Dataset for Real Environment Applications

In this paper, we propose a novel large scale Visual Question Answering (VQA) dataset, which aims at real environment applications. Existing VQA datasets either require high constructing labor costs or have only limited power for evaluating complicated scene understanding ability involving in VQA tasks. Moreover, most VQA datasets do not tackle scenes containing object occlusion, which could be crucial for real-world applications. In this work, we propose a synthetic multi-view VQA dataset along with a dataset generation process. We build our dataset from three real object model datasets. Each scene is observed from multiple virtual cameras, which often requires a multi-view scene understanding. Our dataset requires relatively low labor cost and in the meantime, have highly complicated visual information. In addition, our dataset can be further adapted to users’ requirements by extending the dataset setup. We evaluated two previous multi-view VQA methods on our datasets. The results show that both 3D understanding and appearance understanding is crucial to achieving high performance in our dataset, and there is still room for future methods to improve. Our dataset provides a possible way for bridging the VQA methods aiming at CG dataset with real-world applications, such as robot picking tasks.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[3]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[4]  Yutaka Satoh,et al.  Incorporating 3D Information Into Visual Question Answering , 2019, 2019 International Conference on 3D Vision (3DV).

[5]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[7]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[8]  Pieter Abbeel,et al.  BigBIRD: A large-scale 3D database of object instances , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  Siddhartha S. Srinivasa,et al.  The YCB object and Model set: Towards common benchmarks for manipulation research , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[11]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[14]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).