Recognizing UMLS Semantic Types with Deep Learning

Entity recognition is a critical first step to a number of clinical NLP applications, such as entity linking and relation extraction. We present the first attempt to apply state-of-the-art entity recognition approaches on a newly released dataset, MedMentions. This dataset contains over 4000 biomedical abstracts, annotated for UMLS semantic types. In comparison to existing datasets, MedMentions contains a far greater number of entity types, and thus represents a more challenging but realistic scenario in a real-world setting. We explore a number of relevant dimensions, including the use of contextual versus non-contextual word embeddings, general versus domain-specific unsupervised pre-training, and different deep learning architectures. We contrast our results against the well-known i2b2 2010 entity recognition dataset, and propose a new method to combine general and domain-specific information. While producing a state-of-the-art result for the i2b2 2010 task (F1 = 0.90), our results on MedMentions are significantly lower (F1 = 0.63), suggesting there is still plenty of opportunity for improvement on this new data.

[1]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[2]  W. Chapman,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[5]  Olivier Ferret,et al.  Evaluation of a Sequence Tagging Tool for Biomedical Texts , 2018, Louhi@EMNLP.

[6]  Ioannis Ch. Paschalidis,et al.  Clinical Concept Extraction with Contextual Word Embedding , 2018, NIPS 2018.

[7]  Sophia Ananiadou,et al.  Improving the Extraction of Clinical Concepts from Clinical Records , 2014 .

[8]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[9]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[10]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[11]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[12]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Massimo Piccardi,et al.  Bidirectional LSTM-CRF for Clinical Concept Extraction , 2016, ClinicalNLP@COLING 2016.

[15]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[16]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[17]  Massimo Piccardi,et al.  Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition , 2017, J. Biomed. Informatics.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[20]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[21]  Kirk Roberts,et al.  Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP , 2016, ClinicalNLP@COLING 2016.

[22]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[23]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[24]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[25]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[26]  Anna Rumshisky,et al.  CliNER : A Lightweight Tool for Clinical Named Entity Recognition , 2015 .

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[31]  Inanç Birol,et al.  In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition , 2018, Louhi@EMNLP.

[32]  Busra Celikkaya,et al.  Joint Entity Extraction and Assertion Detection for Clinical Text , 2018, ACL.

[33]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[34]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.