论文信息 - MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering(VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised postpretraining with the span mask strategy and supervised prefinetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose cross-guided multimodal attention to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.

[1] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2] Qinghua Zheng,et al. XTQA: Span-Level Explanations of the Textbook Question Answering , 2020, ArXiv.

[3] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[4] Jun Zhu,et al. Textbook Question Answering Under Instructor Guidance with Memory Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Shi Chen,et al. AiR: Attention with Reasoning Capability , 2020, ECCV.

[7] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[8] José Manuél Gómez-Pérez,et al. Look, Read and Enrich - Learning from Scientific Figures and their Captions , 2019, K-CAP.

[9] Byoung-Tak Zhang,et al. Multimodal Residual Learning for Visual QA , 2016, NIPS.

[10] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11] Mahmoud Khademi,et al. Multimodal Neural Graph Memory Networks for Visual Question Answering , 2020, ACL.

[12] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13] Ali Farhadi,et al. A Diagram is Worth a Dozen Images , 2016, ECCV.

[14] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Weifeng Zhang,et al. Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[16] Tongliang Liu,et al. Relation-Aware Fine-Grained Reasoning Network for Textbook Question Answering , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[17] Liang Lin,et al. Interpretable Visual Question Answering by Reasoning on Dependency Trees , 2019, IEEE transactions on pattern analysis and machine intelligence.

[18] Nojun Kwak,et al. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension , 2018, ACL.

[19] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[20] Jose Manuel Gomez-Perez,et al. ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention , 2020, EMNLP.

[21] Jonghyun Choi,et al. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[24] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25] Soumen Chakrabarti,et al. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering , 2021, SIGIR.