Textbook Question Answering Under Instructor Guidance with Memory Networks

Textbook Question Answering (TQA) is a task to choose the most proper answers by reading a multi-modal context of abundant essays and images. TQA serves as a favorable test bed for visual and textual reasoning. However, most of the current methods are incapable of reasoning over the long contexts and images. To address this issue, we propose a novel approach of Instructor Guidance with Memory Networks (IGMN) which conducts the TQA task by finding contradictions between the candidate answers and their corresponding context. We build the Contradiction Entity-Relationship Graph (CERG) to extend the passage-level multi-modal contradictions to an essay level. The machine thus performs as an instructor to extract the essay-level contradictions as the Guidance. Afterwards, we exploit the memory networks to capture the information in the Guidance, and use the attention mechanisms to jointly reason over the global features of the multi-modal input. Extensive experiments demonstrate that our method outperforms the state-of-the-arts on the TQA dataset. The source code is available at https://github.com/freerailway/igmn.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Christopher D. Manning,et al.  Improving Coreference Resolution by Learning Entity-Level Distributed Representations , 2016, ACL.

[5]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[6]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[7]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Gunhee Kim,et al.  Attend to You: Personalized Image Captioning with Context Sequence Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[12]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[13]  James Allen,et al.  Apples to Apples: Learning Semantics of Common Entities Through a Novel Comprehension Task , 2017, ACL.

[14]  David A. McAllester,et al.  Machine Comprehension with Syntax, Frames, and Semantics , 2015, ACL.

[15]  Ali Farhadi,et al.  A Diagram is Worth a Dozen Images , 2016, ECCV.

[16]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[17]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[18]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[21]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[23]  Pengtao Xie,et al.  A Constituent-Centric Neural Architecture for Reading Comprehension , 2017, ACL.

[24]  Christopher D. Manning,et al.  Finding Contradictions in Text , 2008, ACL.

[25]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[26]  Deng Cai,et al.  MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension , 2017, ArXiv.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Luyang Li,et al.  Contradiction Detection with Contradiction-Specific Word Embedding , 2017, Algorithms.

[29]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[30]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[31]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[32]  Gang Wang,et al.  Episodic CAMN: Contextual Attention-Based Memory Networks with Iterative Feedback for Scene Labeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).