TeamTat: a collaborative text annotation tool

Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for figure display, project management, and multi-user team annotation. In response, we developed TeamTat (https://www.teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC (uploaded locally or automatically retrieved from PubMed/PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotator's convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.

[1]  Lars Juhl Jensen,et al.  EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation , 2016, Database J. Biol. Databases Curation.

[2]  Yifan Peng,et al.  BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations , 2017, BioNLP.

[3]  Alfonso Valencia,et al.  The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge , 2016, Database J. Biol. Databases Curation.

[4]  Martín Pérez-Pérez,et al.  Marky: A tool supporting annotation consistency in multi-user and iterative document annotation projects , 2015, Comput. Methods Programs Biomed..

[5]  Mariana L. Neves,et al.  A survey on annotation tools for the biomedical literature , 2014, Briefings Bioinform..

[6]  Zhiyong Lu,et al.  ezTag: tagging biomedical concepts via interactive learning , 2018, Nucleic Acids Res..

[7]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[8]  André L. M. Santos,et al.  BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID , 2016, Database J. Biol. Databases Curation.

[9]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[10]  José Luís Oliveira,et al.  Egas: a collaborative and interactive document curation platform , 2014, Database J. Biol. Databases Curation.

[11]  Florentino Fernández Riverola,et al.  BioAnnote: A software platform for annotating biomedical documents with application in medical learning environments , 2013, Comput. Methods Programs Biomed..

[12]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[13]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[14]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[15]  W. John Wilbur,et al.  Assisting manual literature curation for protein–protein interactions using BioQRator , 2014, Database J. Biol. Databases Curation.

[16]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[17]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[18]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[19]  Zhiyong Lu,et al.  PMC text mining subset in BioC: about three million full-text articles and growing , 2019, Bioinform..

[20]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[21]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[22]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[23]  Martín Pérez-Pérez,et al.  Marky: A Lightweight Web Tracking Tool for Document Annotation , 2014, PACBB.

[24]  Kara Dolinski,et al.  The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions , 2017, Database J. Biol. Databases Curation.

[25]  Robert Leaman,et al.  PubTator central: automated concept annotation for biomedical full text articles , 2019, Nucleic Acids Res..

[26]  Yue Wang,et al.  PubAnnotation - a persistent and sharable corpus and annotation repository , 2012, BioNLP@HLT-NAACL.

[27]  Kalina Bontcheva,et al.  GATE Teamware: a web-based, collaborative text annotation framework , 2013, Lang. Resour. Evaluation.

[28]  Jurica Ševa,et al.  An extensive review of tools for manual annotation of documents , 2019, Briefings Bioinform..

[29]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[30]  Johan Bos,et al.  Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , 2012 .