TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happened before/after [some event]?" We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.

[1]  Olivier Ferret,et al.  Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers , 2017, ACL.

[2]  Chen Lin,et al.  Neural Temporal Relation Extraction , 2017, EACL.

[3]  Dan Roth,et al.  Temporal Common Sense Acquisition with Minimal Supervision , 2020, ACL.

[4]  Martha Palmer,et al.  Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation , 2016 .

[5]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[6]  Reut Tsarfaty,et al.  Evaluating NLP Models via Contrast Sets , 2020, ArXiv.

[7]  Jonathan Berant,et al.  On Making Reading Comprehension More Comprehensive , 2019, EMNLP.

[8]  Ido Dagan,et al.  Crowdsourcing Question-Answer Meaning Representations , 2017, NAACL.

[9]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[10]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[11]  Dan Roth,et al.  An Improved Neural Baseline for Temporal Relation Extraction , 2019, EMNLP.

[12]  Yukari Yamakawa,et al.  Event Nugget Annotation: Processes and Issues , 2015, EVENTS@HLP-NAACL.

[13]  Noah A. Smith,et al.  Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Egoitz Laparra,et al.  SemEval 2018 Task 6: Parsing Time Normalizations , 2018, *SEMEVAL.

[16]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[17]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[18]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[19]  Nathanael Chambers,et al.  CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures , 2016, EVENTS@HLT-NAACL.

[20]  Dan Roth,et al.  A Structured Learning Approach to Temporal Relation Extraction , 2017, EMNLP.

[21]  Marie-Francine Moens,et al.  Structured Learning for Temporal Relation Extraction from Clinical Records , 2017, EACL.

[22]  James H. Martin,et al.  Timelines from Text: Identification of Syntactic Temporal Relations , 2007, International Conference on Semantic Computing (ICSC 2007).

[23]  Tommaso Caselli,et al.  SemEval-2010 Task 13: TempEval-2 , 2010, *SEMEVAL.

[24]  James Pustejovsky,et al.  SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations , 2013, *SEMEVAL.

[25]  Yusuke Miyao,et al.  Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths , 2017, ACL.

[26]  James F. Allen Towards a General Theory of Action and Time , 1984, Artif. Intell..

[27]  James Pustejovsky,et al.  SemEval-2017 Task 12: Clinical TempEval , 2017, *SEMEVAL.

[28]  Luke S. Zettlemoyer,et al.  Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language , 2015, EMNLP.

[29]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[30]  Anna Rumshisky,et al.  Context-Aware Neural Model for Temporal Information Extraction , 2018, ACL.

[31]  Hao Wu,et al.  A Multi-Axis Annotation Scheme for Event Temporal Relations , 2018, ACL.

[32]  Taylor Cassidy,et al.  An Annotation Framework for Dense Event Ordering , 2014, ACL.

[33]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34]  Taylor Cassidy,et al.  Dense Event Ordering with a Multi-Pass Architecture , 2014, TACL.

[35]  Chen Lin,et al.  Representations of Time Expressions for Temporal Relation Extraction with Convolutional Neural Networks , 2017, BioNLP.

[36]  Kevin Lin,et al.  Reasoning Over Paragraph Effects in Situations , 2019, MRQA@EMNLP.

[37]  James Pustejovsky,et al.  SemEval-2007 Task 15: TempEval Temporal Relation Identification , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[38]  D. Roth,et al.  QuASE: Question-Answer Driven Sentence Encoding , 2019, Annual Meeting of the Association for Computational Linguistics.

[39]  Carmen DeNavas-Walt,et al.  Income and Poverty in the United States: 2013 , 2014 .

[40]  Marie-Francine Moens,et al.  Temporal Information Extraction by Predicting Relative Time-lines , 2018, EMNLP.

[41]  Chen Lin,et al.  Temporal Annotation in the Clinical Domain , 2014, TACL.

[42]  Eduardo Blanco,et al.  Determining Event Durations: Models and Error Analysis , 2018, NAACL.

[43]  James Pustejovsky,et al.  SemEval-2015 Task 5: QA TempEval - Evaluating Temporal Information Understanding with Question Answering , 2015, *SEMEVAL.

[44]  Hao Wu,et al.  Joint Reasoning for Temporal and Causal Relations , 2018, ACL.

[45]  Dan Roth,et al.  Joint Inference for Event Timeline Construction , 2012, EMNLP.