MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

[1]  Andrew McCallum,et al.  Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking , 2018, ACL.

[2]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013 , 2015, BMC Bioinformatics.

[3]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[4]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[5]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[6]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[7]  Tom M. Mitchell,et al.  Zero-shot Learning of Classifiers from Natural Language Quantification , 2018, ACL.

[8]  Claire Grover,et al.  The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions , 2008 .

[9]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ioannis A. Kakadiaris,et al.  Results of the sixth edition of the BioASQ Challenge , 2018 .

[11]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[12]  Andrew McCallum,et al.  Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[13]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[14]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[15]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[16]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[17]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[18]  Sampo Pyysalo,et al.  A neural network multi-task learning approach to biomedical named entity recognition , 2017, BMC Bioinformatics.

[19]  Zhiyong Lu,et al.  An improved corpus of disease mentions in PubMed citations , 2012, BioNLP@HLT-NAACL.

[20]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[21]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..