Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts

In this paper, we report a knowledge-based method for Word Sense Disambiguation in the domains of biomedical and clinical text. We combine word representations created on large corpora with a small number of definitions from the UMLS to create concept representations, which we then compare to representations of the context of ambiguous terms. Using no relational information, we obtain comparable performance to previous approaches on the MSH-WSD dataset, which is a well-known dataset in the biomedical domain. Additionally, our method is fast and easy to set up and extend to other domains. Supplementary materials, including source code, can be found at https: //github.com/clips/yarn

[1]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[2]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[3]  KilicogluHalil,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing , 2006 .

[4]  Ted Pedersen,et al.  Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text , 2013, J. Biomed. Informatics.

[5]  Rafael Berlanga,et al.  Knowledge based word-concept model estimation and refinement for biomedical text mining. , 2015, Journal of biomedical informatics.

[6]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[7]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[8]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[9]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[10]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[11]  Bridget T. McInnes An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline , 2008, ACL.

[12]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[13]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[14]  Walter Daelemans,et al.  Pattern for Python , 2012, J. Mach. Learn. Res..

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[17]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[20]  Fabien L. Gandon,et al.  Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy , 2009, BMC Bioinformatics.

[21]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.