Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes

Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems’ output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems’ annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the ShARe/CLEF (https://sites.google.com/site/shareclefehealth/data) and i2b2 (https://i2b2.org/NLP/DataSets/) corpora needs to be requested with the individual corpus providers.

[1]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[2]  Dietrich Rebholz-Schuhmann,et al.  Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources , 2013, J. Biomed. Semant..

[3]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[4]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[5]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[6]  Laura Inés Furlong,et al.  Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.

[7]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[8]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[9]  D. Nelson,et al.  Trinucleotide repeat expansions in neurological disease , 1993, Current Opinion in Neurobiology.

[10]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[13]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[14]  Allen C. Browne,et al.  UMLS knowledge for biomedical language processing. , 1993, Bulletin of the Medical Library Association.

[15]  Martin Hofmann-Apitius,et al.  Text mining for systems biology. , 2014, Drug discovery today.

[16]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[17]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[18]  Robert Hoehndorf,et al.  Mouse genetic and phenotypic resources for human genetics , 2012, Human mutation.

[19]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[20]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[21]  Daniel L. Rubin,et al.  Comparison of concept recognizers for building the Open Biomedical Annotator , 2009, BMC Bioinformatics.

[22]  Dietrich Rebholz-Schuhmann,et al.  BioLexicon: A Lexical Resource for the Biology Domain , 2008, SMBM 2008.

[23]  Dietrich Rebholz-Schuhmann,et al.  PCorral—interactive mining of protein interactions from MEDLINE , 2013, Database J. Biol. Databases Curation.

[24]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[25]  Nigel Collier,et al.  Using silver and semi-gold standard corpora to compare open named entity recognisers , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[26]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[27]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[28]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[29]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[30]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[31]  C. McDonald,et al.  LOINC, a universal standard for identifying laboratory observations: a 5-year update. , 2003, Clinical chemistry.