论文信息 - XTQA: Span-Level Explanations of the Textbook Question Answering

XTQA: Span-Level Explanations of the Textbook Question Answering

Textbook Question Answering (TQA) is a task that one should answer a diagram/non-diagram question given a large multi-modal context consisting of abundant essays and diagrams. We argue that the explainability of this task should place students as a key aspect to be considered. To address this issue, we devise a novel architecture towards span-level eXplanations of the TQA (XTQA) based on our proposed coarse-to-fine grained algorithm, which can provide not only the answers but also the span-level evidences to choose them for students. This algorithm first coarsely chooses top $M$ paragraphs relevant to questions using the TF-IDF method, and then chooses top $K$ evidence spans finely from all candidate spans within these paragraphs by computing the information gain of each span to questions. Experimental results shows that XTQA significantly improves the state-of-the-art performance compared with baselines. The source code is available at this https URL

[1] Joshua B. Tenenbaum,et al. Separating Style and Content , 1996, NIPS.

[2] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3] Jonghyun Choi,et al. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Apoorv Saxena,et al. Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings , 2020, ACL.

[6] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Zhou Yu,et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[9] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[10] Xin Hu,et al. Jointly Optimized Neural Coreference Resolution with Mutual Attention , 2020, WSDM.

[11] Jaewoo Kang,et al. Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering , 2018, EMNLP.

[12] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Peng Gao,et al. Multi-Modality Latent Interaction Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Chuang Gan,et al. The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[16] Wei Wang,et al. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[17] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Parisa Kordjamshidi,et al. Cross-Modality Relevance for Reasoning on Language and Vision , 2020, ACL.

[19] Wendy Grace Lehnert,et al. The Process of Question Answering , 2022 .

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Mohit Bansal,et al. Revealing the Importance of Semantic Retrieval for Machine Reading at Scale , 2019, EMNLP.

[22] Chang Zhou,et al. Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.

[23] Mahmoud Khademi,et al. Multimodal Neural Graph Memory Networks for Visual Question Answering , 2020, ACL.

[24] Chuang Gan,et al. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.