Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text

INTRODUCTION In this article, we evaluate a knowledge-based word sense disambiguation method that determines the intended concept associated with an ambiguous word in biomedical text using semantic similarity and relatedness measures. These measures quantify the degree of similarity or relatedness between concepts in the Unified Medical Language System (UMLS). The objective of this work is to develop a method that can disambiguate terms in biomedical text by exploiting similarity and relatedness information extracted from biomedical resources and to evaluate the efficacy of these measure on WSD. METHOD We evaluate our method on a biomedical dataset (MSH-WSD) that contains 203 ambiguous terms and acronyms. RESULTS We show that information content-based measures derived from either a corpus or taxonomy obtain a higher disambiguation accuracy than path-based measures or relatedness measures on the MSH-WSD dataset. AVAILABILITY The WSD system is open source and freely available from http://search.cpan.org/dist/UMLS-SenseRelate/. The MSH-WSD dataset is available from the National Library of Medicine http://wsd.nlm.nih.gov.

[1]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[2]  Ted Pedersen The effect of different context representations on word sense discrimination in biomedical texts , 2010, IHI.

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[5]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[6]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[7]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[8]  Ted Pedersen,et al.  Using semantic relatedness for word sense disambiguation , 2002 .

[9]  Hisham Al-Mubaid,et al.  New ontology-based semantic similarity measure for the biomedical domain , 2006, 2006 IEEE International Conference on Granular Computing.

[10]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[11]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[12]  Antonio Jimeno-Yepes,et al.  Knowledge-based biomedical word sense disambiguation: comparison of approaches , 2010, BMC Bioinformatics.

[13]  Eneko Agirre,et al.  Graph-based Word Sense Disambiguation of biomedical documents , 2010, Bioinform..

[14]  Ebru Arisoy,et al.  Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop , 2008 .

[15]  Mark Stevenson,et al.  Disambiguation of biomedical text using diverse sources of information , 2008, BMC Bioinformatics.

[16]  Fabien L. Gandon,et al.  Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy , 2009, BMC Bioinformatics.

[17]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[18]  Eneko Agirre,et al.  Exploiting domain information for Word Sense Disambiguation of medical documents , 2011, J. Am. Medical Informatics Assoc..

[19]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[20]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[21]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[22]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[23]  Bridget T. McInnes,et al.  Knowledge-based method for determining the meaning of ambiguous biomedical terms using information content measures of similarity. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[24]  Cynthia Brandt,et al.  Knowledge-Based Biomedical Word Sense Disambiguation: An Evaluation and Application to Clinical Document Classification , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[25]  Ted Pedersen,et al.  Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet , 2012, IHI '12.

[26]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[27]  James J. Cimino,et al.  Automated knowledge extraction from the UMLS , 1998, AMIA.

[28]  Eneko Agirre,et al.  Two birds with one stone: learning semantic models for text categorization and word sense disambiguation , 2011, CIKM '11.

[29]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[30]  Bridget T. McInnes An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline , 2008, ACL.

[31]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[32]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[33]  Hwee Tou Ng,et al.  It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text , 2010, ACL.

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[36]  Mark Stevenson,et al.  Disambiguation of Biomedical Abbreviations , 2009, BioNLP@HLT-NAACL.

[37]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[38]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[39]  Dina Demner-Fushman,et al.  Application of Information Technology: Essie: A Concept-based Search Engine for Structured Biomedical Text , 2007, J. Am. Medical Informatics Assoc..

[40]  Joakim Nivre,et al.  Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics , 2009 .