Measures of semantic similarity and relatedness in the biomedical domain

Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

[1]  C G Chute,et al.  Latent Semantic Indexing of medical diagnoses using UMLS semantic structures. , 1991, Proceedings. Symposium on Computer Applications in Medical Care.

[2]  Mark Stevenson,et al.  A Semantic Approach to IE Pattern Induction , 2005, ACL.

[3]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[4]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[5]  Douglas L. Crowson,et al.  Medical information retrieval and WWW browsers at Mayo. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  C G Chute,et al.  An evaluation of concept based latent semantic indexing for clinical information retrieval. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.

[8]  C. G. Chute Classification And Retrieval Of Patient Records Using Natural Language: An Experimental Application Of Latent Semantic Analysis , 1991, Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society Volume 13: 1991.

[9]  Christopher G. Chute The Classification of Medical Events Using Latent Semantic Analysis , 1991 .

[10]  Christiane Fellbaum,et al.  Wordnet and Class-Based Probabilities , 1998 .

[11]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[12]  D. Carnine Utilization of Contextual Information in Determining the Meaning of Unfamiliar Words. , 1984 .

[13]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[14]  Serguei V. S. Pakhomov Modeling Filled Pauses in Medical Dictations , 1999, ACL.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[17]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[18]  Hinrich Sch Automatic Word Sense Discrimination , 1998 .

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[21]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[22]  Sophia Ananiadou,et al.  A Flexible Measure of Contextual Similarity for Biomedical Terms , 2004, Pacific Symposium on Biocomputing.

[23]  Olivier Bodenreider,et al.  Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System , 2001 .

[24]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[25]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[26]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[27]  Ted Pedersen,et al.  Using semantic relatedness for word sense disambiguation , 2002 .

[28]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[29]  Rajat Raina,et al.  Robust Textual Inference Via Learning and Abductive Reasoning , 2005, AAAI.

[30]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[31]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[32]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[33]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[34]  Philip Resnik WordNet and class-based probabilities , 1998 .

[35]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[36]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[37]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[38]  Michael Ramscar,et al.  Testing the Distributioanl Hypothesis: The influence of Context on Judgements of Semantic Similarity , 2001 .

[39]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[40]  D. Stallknecht FACT SHEET , 2006 .

[41]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[42]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[43]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[44]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[45]  Siddharth Patwardhan,et al.  Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatednes , 2003 .