论文信息 - Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier

Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier

Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

Thomas C. Rindflesch | Gondy Leroy | T. Rindflesch | Gondy Leroy

[1] Ted Pedersen,et al. A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[2] David Yarowsky,et al. Combining Classifiers for word sense disambiguation , 2002, Nat. Lang. Eng..

[3] Raymond J. Mooney,et al. Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[4] Hongfang Liu,et al. Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[5] Vasileios Hatzivassiloglou,et al. Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[6] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7] Marc Weeber,et al. Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[8] Betsy L. Humphreys,et al. Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[9] Naomi C. Broering,et al. High performance medical libraries: Advances in information management for the virtual era , 1993 .

[10] Nancy Ide,et al. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[11] Alexa T. McCray,et al. Representing biomedical knowledge in the UMLS semantic network , 1993 .

[12] George A. Miller,et al. Introduction to WordNet: An On-line Lexical Database , 1990 .

[13] Graeme Hirst,et al. Automatic Sense Disambiguation of the Near-Synonyms in a Dictionary Entry , 2003, CICLing.

[14] Alan R. Aronson,et al. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[15] Hongfang Liu,et al. Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..