MedTAG: a portable and customizable annotation tool for biomedical documents

Background Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. Results We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. Conclusions MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.

[1]  Arron Lacey,et al.  Markup: A Web-Based Annotation Tool Powered by Active Learning , 2021, Frontiers in Digital Health.

[2]  Thomas Searle,et al.  MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation , 2019, EMNLP.

[3]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[4]  Fabio Rinaldi,et al.  ODIN: An Advanced Interface for the Curation of Biomedical Literature , 2010 .

[5]  Zhiyong Lu,et al.  TeamTat: a collaborative text annotation tool , 2020, Nucleic Acids Res..

[6]  Stephen C. Ekker,et al.  Mojo Hand, a TALEN design tool for genome editing applications , 2013, BMC Bioinformatics.

[7]  Dietrich Rebholz-Schuhmann,et al.  CALBC: Releasing the Final Corpora , 2012, LREC.

[8]  Jan-Christoph Klie INCEpTION: Interactive machine-assisted annotation , 2018, DESIRES.

[9]  Martín Pérez-Pérez,et al.  Marky: A tool supporting annotation consistency in multi-user and iterative document annotation projects , 2015, Comput. Methods Programs Biomed..

[10]  Sampo Pyysalo,et al.  BioCause: Annotating and analysing causality in the biomedical domain , 2013, BMC Bioinformatics.

[11]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[12]  Burkhard Rost,et al.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles , 2014, Database J. Biol. Databases Curation.

[13]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[14]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[15]  Sophia Ananiadou,et al.  Using uncertainty to link and rank evidence from biomedical literature for model curation , 2017, Bioinform..

[16]  Sampo Pyysalo,et al.  Open-domain Anatomical Entity Mention Detection , 2012, ACL 2012.

[17]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[18]  Yue Wang,et al.  PubAnnotation - a persistent and sharable corpus and annotation repository , 2012, BioNLP@HLT-NAACL.

[19]  Alfonso Valencia,et al.  MyMiner: a web application for computer-assisted biocuration and text annotation , 2012, Bioinform..

[20]  Iryna Gurevych,et al.  Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains , 2019, ACL.

[21]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[22]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[23]  Elena Tutubalina,et al.  The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews , 2020, Bioinform..

[24]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[25]  Alex Sánchez-Pla,et al.  Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data , 2018, BMC Bioinformatics.

[26]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[27]  Kalina Bontcheva,et al.  GATE Teamware: a web-based, collaborative text annotation framework , 2013, Lang. Resour. Evaluation.

[28]  Zhiyong Lu,et al.  ezTag: tagging biomedical concepts via interactive learning , 2018, Nucleic Acids Res..

[29]  S. Menke,et al.  Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology , 2021, JMIR medical informatics.

[30]  Nancy Ide,et al.  Towards cross-platform interoperability for machine-assisted text annotation , 2019, Genomics & informatics.

[31]  Iryna Gurevych,et al.  The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation , 2018, COLING.

[32]  Mariana L. Neves,et al.  A survey on annotation tools for the biomedical literature , 2014, Briefings Bioinform..

[33]  Angus Roberts,et al.  Bio-YODIE: A Named Entity Linking System for Biomedical Text , 2018, ArXiv.

[34]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[35]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[36]  Zina M. Ibrahim,et al.  SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research , 2017, bioRxiv.

[37]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[38]  Dietrich Rebholz-Schuhmann,et al.  A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC , 2015, J. Am. Medical Informatics Assoc..

[39]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[40]  Jelena Jovanovic,et al.  Semantic annotation in biomedicine: the current landscape , 2017, Journal of Biomedical Semantics.

[41]  Richard Dobson,et al.  MedCAT - Medical Concept Annotation Tool , 2019, ArXiv.

[42]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[43]  José Luís Oliveira,et al.  A modular framework for biomedical concept recognition , 2013, BMC Bioinformatics.

[44]  T. Murdoch,et al.  The inevitable application of big data to health care. , 2013, JAMA.

[45]  Jana Zvárová,et al.  Tool-supported Interactive Correction and Semantic Annotation of Narrative Clinical Reports. , 2017, Methods of information in medicine.

[46]  Jurica Ševa,et al.  An extensive review of tools for manual annotation of documents , 2019, Briefings Bioinform..

[47]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[48]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[49]  Thierry Hamon,et al.  A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT) , 2018, Lang. Resour. Evaluation.

[50]  Robert Leaman,et al.  PubTator central: automated concept annotation for biomedical full text articles , 2019, Nucleic Acids Res..

[51]  José Luís Oliveira,et al.  Egas: a collaborative and interactive document curation platform , 2014, Database J. Biol. Databases Curation.