Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Summary Annotators of text corpora and biomedical databases carry out the same labor-intensive task to manually extract structured data from unstructured text. Tasks are needlessly repeated because text corpora are widely scattered. We envision that a linked annotation resource unifying many corpora could be a game changer. Such an open forum will help focus on novel annotations and on optimally benefiting from the energy of many experts. As proof-of-concept, we annotated protein subcellular localization in 100 abstracts cited by UniProtKB. The detailed comparison between our new corpus and the original UniProtKB annotations revealed sustained novel annotations for 42% of the entries (proteins). In a unified linked annotation resource these could immediately extend the utility of text corpora beyond the textmining community. Our example motivates the central idea that linked annotations from text corpora can complement database annotations. Background The natural language processing (NLP) and biomedical research communities have in common that they invest great effort into making high-quality manual annotation of biomedical literature. The focus and the annotation strategies of the two communities have, however, differed so much that collaborations remained stunningly limited. Most text corpora contain detailed markup of only a few types of entities and relationships