ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning

Understanding how events are semantically related to each other is the essence of reading comprehension. Recent event-centric reading comprehension datasets focus mostly on event arguments or temporal relations. While these tasks partially evaluate machines’ ability of narrative understanding, human-like reading comprehension requires the capability to process event-based information beyond arguments and temporal reasoning. For example, to understand causality between events, we need to infer motivation or purpose; to establish event hierarchy, we need to understand the composition of events. To facilitate these tasks, we introduce ESTER, a comprehensive machine reading comprehension (MRC) dataset for Event Semantic Relation Reasoning. The dataset leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions, and captures 10.1K event relation pairs. Experimental results show that the current SOTA systems achieve 22.1%, 63.3% and 83.5% for token-based exact-match (EM), F1 and event-based HIT@1 scores, which are all significantly below human performances (36.0%, 79.6%, 100% respectively), highlighting our dataset as a challenging benchmark. 1

[1]  Dan Roth,et al.  Minimally Supervised Event Causality Identification , 2011, EMNLP.

[2]  James Pustejovsky,et al.  SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations , 2013, *SEMEVAL.

[3]  Taylor Cassidy,et al.  Dense Event Ordering with a Multi-Pass Architecture , 2014, TACL.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[6]  Wenlin Yao,et al.  Weakly Supervised Subevent Knowledge Acquisition , 2020, EMNLP.

[7]  Jian Liu,et al.  Event Extraction as Machine Reading Comprehension , 2020, EMNLP.

[8]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[9]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[10]  Tommaso Caselli,et al.  The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction , 2017, NEWS@ACL.

[11]  Bob Duckett A Multicultural Dictionary of Literary Terms , 1999 .

[12]  Nathanael Chambers,et al.  CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures , 2016, EVENTS@HLT-NAACL.

[13]  M. O. Lorenz,et al.  Methods of Measuring the Concentration of Wealth , 1905, Publications of the American Statistical Association.

[14]  Paramita Mirza,et al.  Annotating Causality in the TempEval-3 Corpus , 2014, EACL 2014.

[15]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[16]  Claire Cardie,et al.  Event Extraction by Answering (Almost) Natural Questions , 2020, EMNLP.

[17]  Fan Yang,et al.  Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields , 2014, EMNLP.

[18]  Yuji Matsumoto,et al.  Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[19]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[20]  Hao Wu,et al.  Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ , 2020, EMNLP.

[21]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Nanyun Peng,et al.  TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions , 2020, EMNLP.

[24]  Stefano Soatto,et al.  Structured Prediction as Translation between Augmented Natural Languages , 2021, ICLR.

[25]  I. Ntroduction The ACE 2005 ( ACE 05 ) Evaluation Plan Evaluation of the Detection and Recognition of ACE Entities , Values , Temporal Expressions , Relations , and Events 1 , 2022 .

[26]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[27]  P. Wolff Representing causation. , 2007, Journal of experimental psychology. General.

[28]  Teng Ren,et al.  Learning Named Entity Tagger using Domain-Specific Dictionary , 2018, EMNLP.

[29]  Paramita Mirza,et al.  An Analysis of Causality between Events and its Relation to Temporal Information , 2014, COLING.

[30]  Marie-Francine Moens,et al.  HiEve: A Corpus for Extracting Event Hierarchies from News Stories , 2014, LREC.

[31]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[32]  William Harmon,et al.  A Handbook to Literature , 1960 .

[33]  Martha Palmer,et al.  Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation , 2016 .

[34]  Li Dong,et al.  Learning a Unified Named Entity Tagger from Multiple Partially Annotated Corpora for Efficient Adaptation , 2019, CoNLL.

[35]  Jiawei Han,et al.  Document-Level Event Argument Extraction by Conditional Generation , 2021, NAACL.