A survey on annotation tools for the biomedical literature

New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

[1]  Justin Powlowski,et al.  Semantic text mining for lignocellulose research , 2011, DTMBIO '11.

[2]  Yitao Zhang,et al.  Extracting Semantics in a Clinical Scenario , 2007, ACSW.

[3]  Ted Briscoe,et al.  Integrating Natural Language Processing with Flybase Curation , 2006, Pacific Symposium on Biocomputing.

[4]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[5]  Armando Blanco,et al.  Collaborative text-annotation resource for disease-centered relation extraction from biomedical text , 2009, J. Biomed. Informatics.

[6]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[7]  Mike Tyers,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011, BMC Bioinformatics.

[8]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[9]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[10]  Richard Boyce,et al.  Using Natural Language Processing to Extract Drug-Drug Interaction Information from Package Inserts , 2012 .

[11]  Rui Pereira,et al.  Semantic annotation of biological concepts interplaying microbial cellular responses , 2011, BMC Bioinformatics.

[12]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[13]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[14]  Brett R South,et al.  Natural language processing for lines and devices in portable chest x-rays. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[15]  Katrin Erk,et al.  SALTO - A Versatile Multi-Level Annotation Tool , 2006, LREC.

[16]  Ted Briscoe,et al.  Natural Language Processing in aid of FlyBase curators , 2008, BMC Bioinformatics.

[17]  H. Cunningham,et al.  Web-based Collaborative Corpus Annotation : Requirements and a Framework Implementation , 2010 .

[18]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[19]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[20]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[21]  Corinna Kolárik,et al.  Identification of histone modifications in biomedical text for supporting epigenomic research , 2009, BMC Bioinformatics.

[22]  Ben Shneiderman,et al.  Towards event sequence representation, reasoning and visualization for EHR data , 2012, IHI '12.

[23]  Udo Hahn,et al.  Efficient Annotation with the Jena ANnotation Environment (JANE) , 2007, LAW@ACL.

[24]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[25]  Fabio Rinaldi,et al.  ODIN: An Advanced Interface for the Curation of Biomedical Literature , 2010 .

[26]  S. A. Jalaee,et al.  Abstract , 1999, Veterinary Record.

[27]  Laura Inés Furlong,et al.  Identifying gene-Specific Variations in Biomedical Text , 2007, J. Bioinform. Comput. Biol..

[28]  Wen-Lian Hsu,et al.  A Semi-Automatic Method for Annotating a Biomedical Proposition Bank , 2006 .

[29]  Georgios Paliouras,et al.  Using the Ellogon Natural Language Engineering Infrastructure , 2003 .

[30]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[31]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[32]  Akinori Yonezawa,et al.  The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[33]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[34]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[35]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[36]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[37]  Sampo Pyysalo,et al.  EXTRACTING BIO‐MOLECULAR EVENTS FROM LITERATURE—THE BIONLP’09 SHARED TASK , 2011, Comput. Intell..

[38]  Claire Grover,et al.  The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions , 2008 .

[39]  Escuela Politécnica Superior Demonstration of the UAM CorpusTool for text and image annotation , 2008 .

[40]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[41]  Laurel D. Riek,et al.  Callisto: A Configurable Annotation Workbench , 2004, LREC.

[42]  Noriko Tomuro,et al.  Djangology: A Light-weight Web-based Tool for Distributed Collaborative Text Annotation , 2010, LREC.

[43]  Alfonso Valencia,et al.  MyMiner: a web application for computer-assisted biocuration and text annotation , 2012, Bioinform..

[44]  Jun'ichi Tsujii,et al.  New challenges for text mining: mapping between text and manually curated pathways , 2008, BMC Bioinformatics.

[45]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[46]  E. Ferreira,et al.  Stringent response of Escherichia coli: revisiting the bibliome using literature mining , 2011, Microbial Informatics and Experimentation.

[47]  Leonid Peshkin,et al.  Social and Semantic Web Technologies for the Text-to-Knowledge Translation Process in Biomedicine , 2011 .

[48]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[49]  Sophia Ananiadou,et al.  Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora , 2008, LREC.

[50]  Caitlin Murphy,et al.  Towards Evaluating the Impact of Semantic Support for Curating the Fungus Scientic Literature , 2011, CSWS.

[51]  K. Bretonnel Cohen,et al.  Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) , 2009, BMC Bioinformatics.

[52]  Elena Beisswanger,et al.  Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab , 2008, LREC.

[53]  Stefanie Dipper,et al.  Simple Annotation Tools for Complex Annotation Tasks : an Evaluation , 2004 .

[54]  René Witte,et al.  OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents , 2011, Bioinform..

[55]  U. Leser,et al.  Annotating and Evaluating Text for Stem Cell Research , 2012 .

[56]  Alfonso Valencia,et al.  How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience , 2012, Database J. Biol. Databases Curation.

[57]  Joshua C. Denny,et al.  Detecting temporal expressions in medical narratives , 2013, Int. J. Medical Informatics.

[58]  R. Iida,et al.  SLAT 2 . 0 : Corpus Construction and Annotation Process Management , 2010 .

[59]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[60]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[61]  Deyu Zhou,et al.  Methodological Review: Extracting interactions between proteins from the literature , 2008 .

[62]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[63]  Maria Liakata,et al.  Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT) , 2009, BioNLP@HLT-NAACL.

[64]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[65]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[66]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[67]  Tim Clark,et al.  Open semantic annotation of scientific publications using DOMEO , 2012, J. Biomed. Semant..

[68]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[69]  Armando Blanco,et al.  Collaborative semi-automatic annotation of the biomedical literature , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[70]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[71]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[72]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[73]  Robert Bossy,et al.  BioNLP Shared Task - The Bacteria Track , 2012, BMC Bioinformatics.

[74]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[75]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[76]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[77]  Euan A Ashley,et al.  A public resource facilitating clinical use of genomes , 2012, Proceedings of the National Academy of Sciences.

[78]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[79]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[80]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[81]  Elena Beisswanger,et al.  The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability , 2010, LREC.

[82]  Thomas S. Morton,et al.  WordFreak: An Open Tool for Linguistic Annotation , 2003, HLT-NAACL.

[83]  Cui Tao,et al.  Semantator: a Semi-automatic Semantic Annotation Tool for Clinical Narratives , 2022 .

[84]  César de Pablo-Sánchez,et al.  Extracting drug-drug interactions from biomedical texts , 2010, BMC Bioinformatics.

[85]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[86]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[87]  Sampo Pyysalo,et al.  Open-domain Anatomical Entity Mention Detection , 2012, ACL 2012.

[88]  Zhiyong Lu,et al.  Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction , 2011, J. Biomed. Informatics.

[89]  Sophia Ananiadou,et al.  Building a Coreference-Annotated Corpus from the Domain of Biochemistry , 2011, BioNLP@ACL.

[90]  Sophia Ananiadou,et al.  Event Frame Extraction Based on a Gene Regulation Corpus , 2008, COLING.

[91]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[92]  Hong Yu,et al.  The biomedical discourse relation bank , 2011, BMC Bioinformatics.

[93]  Constantin Orasan,et al.  PALinkA: A highly customisable tool for discourse annotation , 2003, SIGDIAL Workshop.

[94]  Eugénio C. Ferreira,et al.  @Note: A workbench for Biomedical Text Mining , 2009, J. Biomed. Informatics.

[95]  Benjamin Georgi,et al.  PyMix - The Python mixture package - a tool for clustering of heterogeneous biological data , 2010, BMC Bioinformatics.