We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research disciplines are focused on this problem: for example, information extraction, relation extraction, semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms.
[1]
Eugene Charniak,et al.
Toward a model of children's story comprehension
,
1972
.
[2]
Raymond J. Mooney,et al.
Learning to Parse Database Queries Using Inductive Logic Programming
,
1996,
AAAI/IAAI, Vol. 2.
[3]
Ellen M. Voorhees,et al.
The TREC-8 Question Answering Track Evaluation
,
2000,
TREC.
[4]
Lynette Hirschman,et al.
Deep Read: A Reading Comprehension System
,
1999,
ACL.
[5]
Mats Rooth,et al.
Looking Under the Hood : Tools for Diagnosing your Question Answering Engine
,
2001,
ACL 2001.
[6]
Raymond J. Mooney,et al.
Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing
,
2001,
ECML.
[7]
Bonnie Webber,et al.
In Proceedings of the 3rd International Workshop on Linguistically Interpreted Corpora
,
2003
.
[8]
Sanda M. Harabagiu,et al.
Open-domain textual question answering techniques
,
2003,
Natural Language Engineering.
[9]
Johan Bos,et al.
Automatic Multi-Layer Corpus Annotation for Evaluation Question Answering Methods: CBC4Kids
,
2003,
LINC@EACL.
[10]
K. Balzer.
[What is left].
,
2004,
Pflege Zeitschrift.
[11]
Eugene Agichtein,et al.
Factoid Question Answering over Unstructured and Structured Web Content
,
2005,
TREC.
[12]
Lynette Hirschman,et al.
Reading comprehension tests for computer-based understanding evaluation
,
2005,
Natural Language Engineering.
[13]
Ido Dagan,et al.
The Third PASCAL Recognizing Textual Entailment Challenge
,
2007,
ACL-PASCAL@ACL.
[14]
David C. Wilkins,et al.
Learning strategies for story comprehension: a reinforcement learning approach
,
2005,
ICML '05.
[15]
Cordelia Schmid,et al.
The 2005 PASCAL Visual Object Classes Challenge
,
2005,
MLCW.
[16]
Luke S. Zettlemoyer,et al.
Learning Context-Dependent Mappings from Sentences to Logical Form
,
2009,
ACL.
[17]
Efstathios Stamatatos,et al.
A survey of modern authorship attribution methods
,
2009,
J. Assoc. Inf. Sci. Technol..
[18]
Gökhan Tür,et al.
What is left to be understood in ATIS?
,
2010,
2010 IEEE Spoken Language Technology Workshop.
[19]
Panagiotis G. Ipeirotis,et al.
Running Experiments on Amazon Mechanical Turk
,
2010,
Judgment and Decision Making.
[20]
Lydia B. Chilton,et al.
The labor economics of paid crowdsourcing
,
2010,
EC '10.
[21]
Ido Dagan,et al.
A Confidence Model for Syntactically-Motivated Entailment Proofs
,
2011,
RANLP.
[22]
Dan Roth,et al.
Confidence Driven Unsupervised Semantic Parsing
,
2011,
ACL.
[23]
Manish Agarwal,et al.
Automatic Gap-fill Question Generation from Text Books
,
2011,
BEA@ACL.
[24]
Geoffrey Zweig,et al.
A Challenge Set for Advancing Language Modeling
,
2012,
WLM@NAACL-HLT.
[25]
M. Brysbaert,et al.
Age-of-acquisition ratings for 30,000 English words
,
2012,
Behavior research methods.
[26]
Peter Clark,et al.
An Entailment-Based Approach to the QA4MRE Challenge
,
2012,
CLEF.