Determining the difficulty of Word Sense Disambiguation

Automatic processing of biomedical documents is made difficult by the fact that many of the terms they contain are ambiguous. Word Sense Disambiguation (WSD) systems attempt to resolve these ambiguities and identify the correct meaning. However, the published literature on WSD systems for biomedical documents report considerable differences in performance for different terms. The development of WSD systems is often expensive with respect to acquiring the necessary training data. It would therefore be useful to be able to predict in advance which terms WSD systems are likely to perform well or badly on. This paper explores various methods for estimating the performance of WSD systems on a wide range of ambiguous biomedical terms (including ambiguous words/phrases and abbreviations). The methods include both supervised and unsupervised approaches. The supervised approaches make use of information from labeled training data while the unsupervised ones rely on the UMLS Metathesaurus. The approaches are evaluated by comparing their predictions about how difficult disambiguation will be for ambiguous terms against the output of two WSD systems. We find the supervised methods are the best predictors of WSD difficulty, but are limited by their dependence on labeled training data. The unsupervised methods all perform well in some situations and can be applied more widely.

[1]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[2]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[3]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[4]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[5]  M. A. R T H A P A L,et al.  Making fine-grained and coarse-grained sense distinctions , both manually and automatically , 2005 .

[6]  Mark Stevenson,et al.  Disambiguation of Biomedical Abbreviations , 2009, BioNLP@HLT-NAACL.

[7]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.

[8]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[9]  Mark Stevenson,et al.  Disambiguation of biomedical text using diverse sources of information , 2008, BMC Bioinformatics.

[10]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[13]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[14]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[15]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[16]  Marc Weeber,et al.  Using concepts in literature-based discovery: simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries , 2001 .

[17]  Mark Stevenson,et al.  The Effect of Ambiguity on the Automated Acquisition of WSD Examples , 2010, HLT-NAACL.

[18]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[19]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[22]  WeeberMarc,et al.  Using concepts in literature-based discovery , 2001 .

[23]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[24]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[25]  Antonio Jimeno-Yepes,et al.  Knowledge-based biomedical word sense disambiguation: comparison of approaches , 2010, BMC Bioinformatics.

[26]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[27]  Ted Pedersen,et al.  Using UMLS Concept Unique Identifiers (CUIs) for Word Sense Disambiguation in the Biomedical Domain , 2007, AMIA.

[28]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[29]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[30]  Adam Kilgarriff,et al.  Framework and Results for English SENSEVAL , 2000, Comput. Humanit..

[31]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[32]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[33]  Dina Demner-Fushman,et al.  Application of Information Technology: Essie: A Concept-based Search Engine for Structured Biomedical Text , 2007, J. Am. Medical Informatics Assoc..

[34]  Ying Liu,et al.  Using Second-order Vectors in a Knowledge-based Method for Acronym Disambiguation , 2011, CoNLL.

[35]  Peng Jin,et al.  Estimating and Exploiting the Entropy of Sense Distributions , 2009, HLT-NAACL.

[36]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[37]  Eneko Agirre,et al.  The Basque Country University system: English and Basque tasks , 2004, SENSEVAL@ACL.

[38]  Antonio Jimeno-Yepes,et al.  Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts , 2011, BMC Bioinformatics.

[39]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[40]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[41]  Eneko Agirre,et al.  Graph-based Word Sense Disambiguation of biomedical documents , 2010, Bioinform..