Semantic annotation in biomedicine: the current landscape

The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators.Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today’s annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.

[1]  Graham Wilcock,et al.  Unstructured Information Management Architecture (UIMA) , 2009 .

[2]  Anni Coden,et al.  The ConceptMapper Approach to Named Entity Recognition , 2010, LREC.

[3]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[4]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[5]  Peter Szolovits,et al.  Multilingual Named-Entity Recognition from Parallel Corpora , 2013, CLEF.

[6]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[7]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[8]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[9]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[10]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[11]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[12]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[13]  Dietrich Rebholz-Schuhmann,et al.  Entity Recognition in Parallel Multi-lingual Biomedical Corpora: The CLEF-ER Laboratory Overview , 2013, CLEF.

[14]  Dragan Gasevic,et al.  Evolutionary fine-tuning of automated semantic annotation systems , 2015, Expert Syst. Appl..

[15]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[16]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[17]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  Christine A. Sinsky,et al.  Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties , 2016, Annals of Internal Medicine.

[19]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[20]  Zhiyong Lu,et al.  Annotating chemicals , diseases and their interactions in biomedical literature , 2015 .

[21]  Heng Ji,et al.  Entity linking for biomedical literature , 2014, DTMBIO '14.

[22]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[23]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[24]  Michael Y. Galperin,et al.  The 2016 database issue of Nucleic Acids Research and an updated molecular biology database collection , 2015, Nucleic Acids Res..

[25]  Giuseppe Attardi,et al.  Machine Translation for Entity Recognition across Languages in Biomedical Documents , 2013, CLEF.

[26]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[27]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[28]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[29]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2018 , 2018, CLEF.

[30]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[31]  Adi V. Gundlapalli,et al.  v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text , 2016, EGEMS.

[32]  Daniel L. Rubin,et al.  Comparison of concept recognizers for building the Open Biomedical Annotator , 2009, BMC Bioinformatics.

[33]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[34]  René Witte,et al.  OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents , 2011, Bioinform..

[35]  George R. Thoma,et al.  A Prototype System to Support Evidence-based Practice , 2008, AMIA.

[36]  Cynthia Brandt,et al.  Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification , 2013, J. Am. Medical Informatics Assoc..

[37]  Manolis Tsiknakis,et al.  Semantic biomedical resource discovery: a Natural Language Processing framework , 2015, BMC Medical Informatics and Decision Making.

[38]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[39]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[40]  José Luís Oliveira,et al.  A modular framework for biomedical concept recognition , 2013, BMC Bioinformatics.

[41]  Kenli Li,et al.  Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields , 2015, IEEE Transactions on Parallel and Distributed Systems.

[42]  Hinrich Schütze,et al.  Corpus-level Fine-grained Entity Typing Using Contextual Information , 2015, EMNLP.

[43]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[44]  Anna Rumshisky,et al.  Research and applications: Word sense disambiguation in the clinical domain: a comparison of knowledge-rich and knowledge-poor unsupervised methods , 2014, J. Am. Medical Informatics Assoc..

[45]  Jesualdo Tomás Fernández-Breis,et al.  Generation of open biomedical datasets through ontology-driven transformation and integration processes , 2016, Journal of Biomedical Semantics.

[46]  Dietrich Rebholz-Schuhmann,et al.  A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC , 2015, J. Am. Medical Informatics Assoc..

[47]  Akinori Yonezawa,et al.  Building Linked Open Data towards integration of biomedical scientific literature with DBpedia , 2013, Journal of Biomedical Semantics.

[48]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[49]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[50]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.

[51]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[52]  Adi V. Gundlapalli,et al.  Sophia: An Expedient UMLS Concept Extraction Annotator , 2014, AMIA.

[53]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[54]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[55]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[56]  Dietrich Rebholz-Schuhmann,et al.  Distributed Modules for Text Annotation and IE Applied to the Biomedical Domain , 2004, NLPBA/BioNLP.

[57]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[58]  Kenli Li,et al.  CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework , 2015, Cluster Computing.

[59]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[60]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[61]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[62]  Simon Clematide,et al.  Deriving an English Biomedical Silver Standard Corpus for CLEF-ER , 2013, CLEF.

[63]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[64]  Omid Ghiasvand,et al.  Unsupervised Biomedical Named Entity Recognition , 2017 .

[65]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[66]  Dragan Gasevic,et al.  Automated Semantic Tagging of Textual Content , 2014, IT Professional.

[67]  Matthew Scotch,et al.  The Yale cTAKES extensions for document classification: architecture and application , 2011, J. Am. Medical Informatics Assoc..

[68]  Martijn J. Schuemie,et al.  Peregrine: Lightweight gene name normalization by dictionary lookup , 2007 .

[69]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[70]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[71]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[72]  Erik M. van Mulligen,et al.  Erasmus MC at CLEF eHealth 2016: Concept Recognition and Coding in French Texts , 2016, CLEF.

[73]  Antonio Jimeno-Yepes,et al.  The NLM Medical Text Indexer System for Indexing Biomedical Literature , 2013, BioASQ@CLEF.

[74]  Sampo Pyysalo,et al.  Open-domain Anatomical Entity Mention Detection , 2012, ACL 2012.

[75]  Dietrich Rebholz-Schuhmann,et al.  CALBC: Releasing the Final Corpora , 2012, LREC.

[76]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[77]  Miguel Pignatelli,et al.  Database: The Journal of Biological Databases and Curation , 2016 .

[78]  Domonkos Tikk,et al.  Improving textual medication extraction using combined conditional random fields and rule-based systems , 2010, J. Am. Medical Informatics Assoc..

[79]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[80]  Robert G Hill,et al.  4000 clicks: a productivity analysis of electronic medical records in a community hospital ED. , 2013, The American journal of emergency medicine.

[81]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.