论文信息 - On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method’s ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analyses are provided that show the value of the dataset. The dataset is available at www.est-vqa.org.

[1] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3] M. Bassok,et al. Judging a book by its cover: Interpretative effects of content on problem-solving transfer , 1995, Memory & cognition.

[4] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[5] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[7] Dhruv Batra,et al. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9] Christopher Kanan,et al. Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[10] Lianwen Jin,et al. Omnidirectional Scene Text Detection with Sequential-free Box Discretization , 2019, IJCAI.

[11] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13] Chee Seng Chan,et al. Total-Text: toward orientation robustness in scene text detection , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[14] C. V. Jawahar,et al. Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[15] Qi Wu,et al. FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Chuang Gan,et al. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[18] Jonghyun Choi,et al. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Sarah Beresford,et al. Judging a Book by Its Cover , 2009 .

[21] Xiaoyong Du,et al. Analogical Reasoning on Chinese Morphological and Semantic Relations , 2018, ACL.

[22] Qi Wu,et al. Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[23] Wafa Khlif,et al. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition — RRC-MLT-2019 , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[24] Lianwen Jin,et al. Curved scene text detection via transverse and longitudinal sequence connection , 2019, Pattern Recognit..

[25] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[26] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[27] Jiebo Luo,et al. VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Peng Wang,et al. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Ernest Valveny,et al. ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[30] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Errui Ding,et al. Chinese Street View Text: Large-Scale Chinese Text Reading With Partially Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Xinlei Chen,et al. Pythia-A platform for vision & language research , 2018 .

[33] Jiri Matas,et al. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[34] Ernest Valveny,et al. Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.