GQA-it: Italian Question Answering on Image Scene Graphs

The recent breakthroughs in the field of deep learning have lead to state-of-theart results in several Computer Vision and Natural Language Processing tasks such as Visual Question Answering (VQA). Nevertheless, the training requirements in cross-linguistic settings are not completely satisfying at the moment. The datasets suitable for training VQA systems for nonEnglish languages are still not available, thus representing a significant barrier for most neural methods. This paper explores the possibility of acquiring in a semiautomatic fashion a large-scale dataset for VQA in Italian. It consists of more than 1 M question-answer pairs over 80k images, with a test set of 3,000 question-answer pairs manually validated. To the best of our knowledge, the models trained on this dataset represent the first attempt to approach VQA in Italian, with experimental results comparable with those obtained on the English original material.

[1]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[2]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[4]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[5]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jörg Tiedemann,et al.  OPUS-MT – Building open translation services for the World , 2020, EAMT.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[13]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Roberto Basili,et al.  Enabling deep learning for large scale question answering in Italian , 2019, Intelligenza Artificiale.

[15]  Pushpak Bhattacharyya,et al.  A Unified Framework for Multilingual and Code-Mixed Visual Question Answering , 2020, AACL.

[16]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Snehasis Mukherjee,et al.  Visual Question Answering using Deep Learning: A Survey and Performance Analysis , 2019, ArXiv.

[18]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.