Validation Methodology for Expert-Annotated Datasets: Event Annotation Case Study

Event detection is still a difficult task due to the complexity and the ambiguity of such entities. On the one hand, we observe a low inter-annotator agreement among experts when annotating events, disregarding the multitude of existing annotation guidelines and their numerous revisions. On the other hand, event extraction systems have a lower measured performance in terms of F1-score compared to other types of entities such as people or locations. In this paper we study the consistency and completeness of expert-annotated datasets for events and time expressions. We propose a data-agnostic validation methodology of such datasets in terms of consistency and completeness. Furthermore, we combine the power of crowds and machines to correct and extend expert-annotated datasets of events. We show the benefit of using crowd-annotated events to train and evaluate a state-of-the-art event extraction system. Our results show that the crowd-annotated events increase the performance of the system by at least 5.3%.

[1]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[2]  Lora Aroyo,et al.  Domain-Independent Quality Measures for Crowd Truth Disagreement , 2013, DeRiVE@ISWC.

[3]  Angel X. Chang,et al.  SUTime: Evaluation in TempEval-3 , 2013, *SEMEVAL.

[4]  Estela Saquete Boró,et al.  TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2 , 2010, *SEMEVAL.

[5]  Lora Aroyo,et al.  Capturing Ambiguity in Crowdsourcing Frame Disambiguation , 2018, HCOMP.

[6]  James Pustejovsky,et al.  SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations , 2013, *SEMEVAL.

[7]  Miao Fan,et al.  Improving Event Detection with Active Learning , 2015, RANLP.

[8]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[9]  Yejin Choi,et al.  Event Detection and Factuality Assessment with Non-Expert Supervision , 2015, EMNLP.

[10]  Michael Gertz,et al.  HeidelTime: Tuning English and Developing Spanish Resources for TempEval-3 , 2013, *SEMEVAL.

[11]  Ralph Grishman,et al.  Using Prediction from Sentential Scope to Build a Pseudo Co-Testing Learner for Event Extraction , 2011, IJCNLP.

[12]  Lora Aroyo,et al.  Studying Topical Relevance with Evidence-based Crowdsourcing , 2018, CIKM.

[13]  Tommaso Caselli,et al.  Crowdsourcing StoryLines: Harnessing the Crowd for Causal Relation Annotation , 2018, EventStory@Coling.

[14]  Sivaji Bandyopadhyay,et al.  JU_CSE: A CRF Based Approach to Annotation of Temporal Expression, Event and Temporal Relations , 2013, SemEval@NAACL-HLT.

[15]  Tommaso Caselli,et al.  Systems' Agreements and Disagreements in Temporal Processing: An Extensive Error Analysis of the TempEval-3 Task , 2018, LREC.

[16]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[17]  Marie-Francine Moens,et al.  KUL: Data-driven Approach to Temporal Parsing of Newswire Articles , 2013, SemEval@NAACL-HLT.

[18]  Aldo Gangemi,et al.  A Comparison of Knowledge Extraction Tools for the Semantic Web , 2013, ESWC.

[19]  Ujwal Gadiraju,et al.  JustEvents: A Crowdsourced Corpus for Event Validation with Strict Temporal Constraints , 2017, ECIR.

[20]  James Pustejovsky,et al.  Temporal and Event Information in Natural Language Text , 2005, Lang. Resour. Evaluation.

[21]  Alessandro Lenci,et al.  Crowdsourcing for the identification of event nominals: an experiment , 2014, LREC.

[22]  Wolfgang Lehner,et al.  Enhancing Named Entity Extraction by Effectively Incorporating the Crowd , 2013, BTW Workshops.

[23]  Chantal van Son,et al.  Resource Interoperability for Sustainable Benchmarking: The Case of Events , 2018, LREC.

[24]  Nate Chambers NavyTime: Event and Time Ordering from Raw Text , 2013, SemEval@NAACL-HLT.

[25]  Steven Bethard,et al.  ClearTK-TimeML: A minimalist approach to TempEval 2013 , 2013, *SEMEVAL.

[26]  Munirathnam Srikanth,et al.  LCC-TE: A Hybrid Approach to Temporal Relation Identification in News Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[27]  Lora Aroyo,et al.  CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement (short paper) , 2018, SAD/CrowdBias@HCOMP.

[28]  Tommaso Caselli,et al.  Temporal Information Annotation: Crowd vs. Experts , 2016, LREC.

[29]  Amanda Stent,et al.  ATT1: Temporal Annotation Using Big Windows and Rich Syntactic and Semantic Features , 2013, SemEval@NAACL-HLT.

[30]  Gianluca Demartini,et al.  Hybrid human-machine information systems: Challenges and opportunities , 2015, Comput. Networks.

[31]  Lora Aroyo,et al.  Harnessing Diversity in Crowds and Machines for Better NER Performance , 2017, ESWC.