An ontology-based similarity measure for biomedical data - Application to radiology reports

BACKGROUND Determining similarity between two individual concepts or two sets of concepts extracted from a free text document is important for various aspects of biomedicine, for instance, to find prior clinical reports for a patient that are relevant to the current clinical context. Using simple concept matching techniques, such as lexicon based comparisons, is typically not sufficient to determine an accurate measure of similarity. METHODS In this study, we tested an enhancement to the standard document vector cosine similarity model in which ontological parent-child (is-a) relationships are exploited. For a given concept, we define a semantic vector consisting of all parent concepts and their corresponding weights as determined by the shortest distance between the concept and parent after accounting for all possible paths. Similarity between the two concepts is then determined by taking the cosine angle between the two corresponding vectors. To test the improvement over the non-semantic document vector cosine similarity model, we measured the similarity between groups of reports arising from similar clinical contexts, including anatomy and imaging procedure. We further applied the similarity metrics within a k-nearest-neighbor (k-NN) algorithm to classify reports based on their anatomical and procedure based groups. 2150 production CT radiology reports (952 abdomen reports and 1128 neuro reports) were used in testing with SNOMED CT, restricted to Body structure, Clinical finding and Procedure branches, as the reference ontology. RESULTS The semantic algorithm preferentially increased the intra-class similarity over the inter-class similarity, with a 0.07 and 0.08 mean increase in the neuro-neuro and abdomen-abdomen pairs versus a 0.04 mean increase in the neuro-abdomen pairs. Using leave-one-out cross-validation in which each document was iteratively used as a test sample while excluding it from the training data, the k-NN based classification accuracy was shown in all cases to be consistently higher with the semantics based measure compared with the non-semantic case. Moreover, the accuracy remained steady even as k value was increased - for the two anatomy related classes accuracy for k=41 was 93.1% with semantics compared to 86.7% without semantics. Similarly, for the eight imaging procedures related classes, accuracy (for k=41) with semantics was 63.8% compared to 60.2% without semantics. At the same k, accuracy improved significantly to 82.8% and 77.4% respectively when procedures were logically grouped together into four classes (such as ignoring contrast information in the imaging procedure description). Similar results were seen at other k-values. CONCLUSIONS The addition of semantic context into the document vector space model improves the ability of the cosine similarity to differentiate between radiology reports of different anatomical and image procedure-based classes. This effect can be leveraged for document classification tasks, which suggests its potential applicability for biomedical information retrieval.

[1]  Younès Bennani,et al.  Semi-structured document categorization with a semantic kernel , 2009, Pattern Recognit..

[2]  Michael Weiner,et al.  XOntoRank: Ontology-Aware Search of Electronic Medical Records , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Olivier Bodenreider,et al.  Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System , 2001 .

[4]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[5]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[6]  Christopher G. Chute,et al.  The National Center for Biomedical Ontology , 2012, J. Am. Medical Informatics Assoc..

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  Sule Gündüz Ögüdücü,et al.  A taxonomy based semantic similarity of documents using the cosine measure , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[9]  Mark A. Musen,et al.  Comparison of Ontology-based Semantic-Similarity Measures , 2008, AMIA.

[10]  Jeffrey P. Krischer,et al.  Use of SNOMED CT to represent clinical research data: a semantic characterization of data items on case report forms in vasculitis research. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[11]  George Hripcsak,et al.  Inter-patient distance metrics using SNOMED CT defining relationships , 2006, J. Biomed. Informatics.

[12]  Cynthia Brandt,et al.  Semantic similarity in the biomedical domain: an evaluation across knowledge sources , 2012, BMC Bioinformatics.

[13]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[14]  Hoa A. Nguyen,et al.  A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[15]  Ramin Khorasani,et al.  Leveraging terminologies for retrieval of radiology reports with critical imaging findings. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[16]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[17]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[18]  Yuval Shahar,et al.  Vaidurya: A multiple-ontology, concept-based, context-sensitive clinical-guideline search engine , 2009, J. Biomed. Informatics.

[19]  Noémie Elhadad,et al.  A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts , 2012, J. Biomed. Informatics.

[20]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[21]  Ted Briscoe,et al.  32nd Annual Meeting of the Association for Computational Linguistics, 27-30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings , 1994, ACL.

[22]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.