Supervised and knowledge-based methods for disambiguating terms in biomedical text using the umls and metamap

Word Sense Disambiguation is the task of automatically identifying the appropriate sense (or concept) of an ambiguous word, for example, the term cold could refer to the temperature or a virus depending on the context in which it is used. Not being able to identify the intended concept of an ambiguous word negatively impacts the accuracy of biomedical applications such as medical coding and indexing which are becoming essential in the biomedical and clinical world with the push towards electronic medical records and the growing amount of information that is available to biomedical researchers and clinicians. This dissertation focuses on disambiguating ambiguous words in biomedical text. This dissertation presents two methods, K-CUI and A-CUI, that can disambiguate ambiguous terms in any biomedical text using information from the Unified Medical Language System (UMLS). K-CUI explores the use of Concept Unique Identifiers (CUIs) as assigned by MetaMap, as features for a supervised learning method for word sense disambiguation. It also investigates four techniques to reduce the noise in the feature set by restricting which CUIs to include. The first technique is windowing, whose results show that in biomedical text indicative CUIs are highly localized. The second is a frequency cutoff, whose results show that when a dataset contains a high majority concept, the features that only occur a few times are essential in disambiguating the minority concepts. The third is a MetaMap Indexing cutoff, whose results show that word concepts are correlated with the topical information describing an instance. The fourth is a semantic similarity cutoff, whose results show in biomedical text, indicative features have a high semantic similarity with at least one of the possible concepts of the ambiguous word.

[1]  Yorick Wilks,et al.  The Interaction of Knowledge Sources in Word Sense Disambiguation , 2001, CL.

[2]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[3]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[4]  Ellen M. Voorhees,et al.  Corpus-Based Statistical Sense Resolution , 1993, HLT.

[5]  Susan McRoy,et al.  Using Multiple Knowledge Sources for Word Sense Discrimination , 1992, Comput. Linguistics.

[6]  Fabien L. Gandon,et al.  Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy , 2009, BMC Bioinformatics.

[7]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[8]  Thomas C. Rindflesch,et al.  Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naïve Bayes Classifier , 2004, MedInfo.

[9]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[10]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[11]  Philip Resnik,et al.  Semantic Classes and Syntactic Ambiguity , 1993, HLT.

[12]  George Hripcsak,et al.  Inter-patient distance metrics using SNOMED CT defining relationships , 2006, J. Biomed. Informatics.

[13]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[14]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[15]  Louise Guthrie,et al.  Lexical Disambiguation using Simulated Annealing , 1992, HLT.

[16]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  David Yarowsky,et al.  Evaluating sense disambiguation across diverse parameter spaces , 2002, Natural Language Engineering.

[19]  Ted Pedersen,et al.  Maximizing Semantic Relatedness to Perform Word Sense Disambiguation , 2005 .

[20]  Mark Stevenson,et al.  Disambiguation of biomedical text using diverse sources of information , 2008, BMC Bioinformatics.

[21]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[22]  Graeme Hirst,et al.  Semantic Interpretation and the Resolution of Ambiguity , 1987, Studies in natural language processing.

[23]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[24]  Eneko Agirre,et al.  The Basque Country University system: English and Basque tasks , 2004, SENSEVAL@ACL.

[25]  Ted Pedersen,et al.  A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation , 2000, ANLP.

[26]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[27]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[28]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[29]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[30]  Mark A. Musen,et al.  UMLS-Query: A Perl Module for Querying the UMLS , 2008, AMIA.

[31]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[32]  Diana McCarthy Word Sense Disambiguation for Acquisition of Selectional Preferences , 1997 .

[33]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[34]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[35]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[36]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[37]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[38]  Carol Friedman,et al.  Word Sense Disambiguation via Semantic Type Classification , 2008, AMIA.

[39]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[40]  Hwee Tou Ng,et al.  Exemplar-Based Word Sense Disambiguation” Some Recent Improvements , 1997, EMNLP.

[41]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[42]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[43]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[44]  Ted Pedersen,et al.  A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[45]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[46]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[47]  D. Id,et al.  Evaluating sense disambiguation across diverse parameter spaces , 2002 .

[48]  Ted Pedersen,et al.  Knowledge Lean Word-Sense Disambiguation , 1997, AAAI/IAAI.

[49]  Thomas C. Rindflesch,et al.  Effects of information and machine learning algorithms on word sense disambiguation with small datasets , 2005, Int. J. Medical Informatics.

[50]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[51]  Graeme Hirst,et al.  Determining Word Sense Dominance Using a Thesaurus , 2006, EACL.

[52]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[53]  Graeme Hirst,et al.  Semantic interpretation and the resolution of ambiguity: (studies in natural language processing) , 1992 .

[54]  Mark A. Musen,et al.  Comparison of Ontology-based Semantic-Similarity Measures , 2008, AMIA.

[55]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[56]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[57]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[58]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[59]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[60]  Ted Pedersen,et al.  SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts , 2005, ACL.

[61]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[62]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[63]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[64]  Hoa A. Nguyen,et al.  A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[65]  Vedat Coskun,et al.  A new semantic similarity measure evaluated in word sense disambiguation , 2005, NODALIDA.

[66]  G. Barone,et al.  A reassessment of the molecular origin of cold denaturation. , 1997, Journal of biochemistry.

[67]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[68]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[69]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[70]  Ted Pedersen,et al.  A Comparative Study of Support Vector Machines Applied to the Supervised Word Sense Disambiguation Problem in the Medical Domain , 2005, IICAI.

[71]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[72]  A Thesis Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation , 2003 .

[73]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[74]  Olivier Bodenreider,et al.  Aligning Knowledge Sources in the UMLS: Methods, Quantitative Results, and Applications , 2004, MedInfo.

[75]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[76]  Hisham Al-Mubaid,et al.  New ontology-based semantic similarity measure for the biomedical domain , 2006, 2006 IEEE International Conference on Granular Computing.

[77]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[78]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[79]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[80]  Keke Chen,et al.  Model Formulation: A Document Clustering and Ranking System for Exploring MEDLINE Citations , 2007, J. Am. Medical Informatics Assoc..

[81]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[82]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..