Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment

An experiment was performed at the National Library of Medicine((R)) (NLM((R))) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System((R)) (UMLS((R))) Metathesaurus((R)). If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based on statistical associations between words in a training set of MEDLINE((R)) citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: "Biological Transport" assigned the ST Cell Function and "Patient transport" assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.

[1]  A N BRANDON Subject list of journals indexed in Index Medicus. , 1962, Bulletin of the Medical Library Association.

[2]  Sophia Ananiadou,et al.  Trucks: a model for automatic multiword term recognition , 2001 .

[3]  Christiane Fellbaum,et al.  Using Wordnet for Text Retrieval , 1998 .

[4]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[5]  Halil Kilicoglu,et al.  Semantic Relations Asserting the Etiology of Genetic Diseases , 2003, AMIA.

[6]  Paul Buitelaar,et al.  Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS , 2003, BioNLP@ACL.

[7]  Susanne M. Humphrey,et al.  Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation , 1999, J. Am. Soc. Inf. Sci..

[8]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[9]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[10]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[11]  Alan R. Aronson,et al.  Exploiting a Large Thesaurus for Information Retrieval , 1994, RIAO.

[12]  Thomas C. Rindflesch,et al.  Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier , 2004, MedInfo.

[13]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[14]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[15]  H R Garner,et al.  Heuristics for Identification of Acronym-Definition Patterns within Text: Towards an Automated Construction of Comprehensive Acronym-Definition Dictionaries , 2002, Methods of Information in Medicine.

[16]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[17]  Elizabeth D. Liddy,et al.  Use of Subject Field Codes from a Machine-Readable Dictionary for Automatic Classification of Documents , 1992 .

[18]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[19]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[20]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[21]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Adam Kilgarriff,et al.  Framework and Results for English SENSEVAL , 2000, Comput. Humanit..

[24]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[25]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[26]  Adam Kilgarriff,et al.  Introduction to the special issue on evaluating word sense disambiguation systems , 2002, Natural Language Engineering.

[27]  Adam Kilgarriff,et al.  The Senseval-3 English lexical sample task , 2004, SENSEVAL@ACL.

[28]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[29]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[30]  Carlo Strapparava,et al.  Pattern abstraction and term similarity for Word Sense Disambiguation: IRST at Senseval-3 , 2004 .

[31]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[32]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[33]  Sanda M. Harabagiu,et al.  The Informative Role of WordNet in Open-Domain Question Answering , 2004, HLT-NAACL 2004.

[34]  Halil Kilicoglu,et al.  Using Natural Language Processing, LocusLink and the Gene Ontology to Compare OMIM to MEDLINE , 2004, HLT-NAACL 2004.

[35]  Philip Resnik,et al.  Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation , 2004, CICLing.

[36]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[37]  Susanne M. Humphrey,et al.  A New Approach to Automatic Indexing Using Journal Descriptors. , 1998 .