Iterative Refinement and Quality Checking of Annotation Guidelines - How to Deal Effectively with Semantically Sloppy Named Entity Types, such as Pathological Phenomena

We here discuss a methodology for dealing with the annotation of semantically hard to delineate, i.e., sloppy, named entity types. To illustrate sloppiness of entities, we treat an example from the medical domain, namely pathological phenomena. Based on our experience with iterative guideline refinement we propose to carefully characterize the thematic scope of the annotation by positive and negative coding lists and allow for alternative, short vs. long mention span annotations. Short spans account for canonical entity mentions (e.g., standardized disease names), while long spans cover descriptive text snippets which contain entity-specific elaborations (e.g., anatomical locations, observational details, etc.). Using this stratified approach, evidence for increasing annotation performance is provided by kappa-based inter-annotator agreement measurements over several, iterative annotation rounds using continuously refined guidelines. The latter reflects the increasing understanding of the sloppy entity class both from the perspective of guideline writers and users (annotators). Given our data, we have gathered evidence that we can deal with sloppiness in a controlled manner and expect inter-annotator agreement values around 80% for PathoJen, the pathological phenomena corpus currently under development in our lab.

[1]  Philip V. Ogren,et al.  Annotation of all coreference in biomedical text : Guideline selection and adaptation , 2010 .

[2]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[3]  Lawrence Hunter,et al.  An Overview of the CRAFT Concept Annotation Guidelines , 2010, Linguistic Annotation Workshop.

[4]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[5]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[6]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[7]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[8]  Udo Hahn,et al.  Efficient Annotation with the Jena ANnotation Environment (JANE) , 2007, LAW@ACL.

[9]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[10]  Udo Hahn,et al.  High-performance gene name normalization with GENO , 2009, Bioinform..

[11]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[12]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[13]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[14]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.