Harnessing Diversity in Crowds and Machines for Better NER Performance

Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve NE Coverage, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.

[1]  Raphaël Troncy,et al.  Enhancing Entity Linking by Combining NER Models , 2016, SemWebEval@ESWC.

[2]  Dirk Hovy,et al.  Crowdsourcing and annotating NER for Twitter #drift , 2014, LREC.

[3]  Diego Reforgiato Recupero,et al.  Using FRED for Named Entity Resolution, Linking and Typing for Knowledge Base Population , 2015, SemWebEval@ESWC.

[4]  Lora Aroyo,et al.  The Three Sides of CrowdTruth , 2014, Hum. Comput..

[5]  Raphaël Troncy,et al.  Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web , 2014, LREC.

[6]  Tommaso Caselli,et al.  Crowdsourcing Salient Information from News and Tweets , 2016, LREC.

[7]  Axel-Cyrille Ngonga Ngomo,et al.  CETUS - A Baseline Approach to Type Extraction , 2015, SemWebEval@ESWC.

[8]  Raphaël Troncy,et al.  NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools , 2012, EACL.

[9]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[10]  Will Fitzgerald,et al.  A Hybrid Model for Annotating Named Entity Training Corpora , 2010, Linguistic Annotation Workshop.

[11]  Raphaël Troncy,et al.  A Hybrid Approach for Entity Recognition and Linking , 2015, SemWebEval@ESWC.

[12]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[13]  Elena Paslaru Bontas Simperl,et al.  Towards Hybrid NER: A Study of Content and Crowdsourcing-Related Performance Factors , 2015, ESWC.

[14]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[15]  Lora Aroyo,et al.  CrowdTruth: Machine-Human Computation Framework for Harnessing Disagreement in Gathering Annotated Data , 2014, SEMWEB.

[16]  Zornitsa Kozareva,et al.  Combining data-driven systems for improving Named Entity Recognition , 2005, Data Knowl. Eng..

[17]  Petra Saskia Bayerl,et al.  What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation , 2011, CL.

[18]  Tommaso Caselli,et al.  Temporal Information Annotation: Crowd vs. Experts , 2016, LREC.

[19]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[20]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[21]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[22]  Lora Aroyo,et al.  Achieving Expert-Level Annotation Quality with CrowdTruth: The Case of Medical Relation Extraction , 2015, BDM2I@ISWC.

[23]  Elena Paslaru Bontas Simperl,et al.  Using microtasks to crowdsource DBpedia entity classification: A study in workflow design , 2018, Semantic Web.

[24]  Giorgio Orsi,et al.  Aggregating Semantic Annotators , 2013, Proc. VLDB Endow..

[25]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[26]  Raphaël Troncy,et al.  Learning with the Web: Spotting Named Entities on the Intersection of NERD and Machine Learning , 2013, #MSM.

[27]  Aldo Gangemi,et al.  A Comparison of Knowledge Extraction Tools for the Semantic Web , 2013, ESWC.

[28]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[29]  Amal Zouaq,et al.  Collective Disambiguation and Semantic Annotation for Entity Linking and Typing , 2016, SemWebEval@ESWC.