A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts

An open research question when leveraging ontological knowledge is when to treat different concepts separately from each other and when to aggregate them. For instance, concepts for the terms "paroxysmal cough" and "nocturnal cough" might be aggregated in a kidney disease study, but should be left separate in a pneumonia study. Determining whether two concepts are similar enough to be aggregated can help build better datasets for data mining purposes and avoid signal dilution. Quantifying the similarity among concepts is a difficult task, however, in part because such similarity is context-dependent. We propose a comprehensive method, which computes a similarity score for a concept pair by combining data-driven and ontology-driven knowledge. We demonstrate our method on concepts from SNOMED-CT and on a corpus of clinical notes of patients with chronic kidney disease. By combining information from usage patterns in clinical notes and from ontological structure, the method can prune out concepts that are simply related from those which are semantically similar. When evaluated against a list of concept pairs annotated for similarity, our method reaches an AUC (area under the curve) of 92%.

[1]  Krzysztof Janowicz,et al.  Kinds of Contexts and their Impact on Semantic Similarity Measurement , 2008, 2008 Sixth Annual IEEE International Conference on Pervasive Computing and Communications (PerCom).

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  Edward Y. Chang,et al.  Formulating context-dependent similarity functions , 2005, MULTIMEDIA '05.

[4]  Terrence Adam,et al.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[5]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Ido Dagan,et al.  Directional distributional similarity for lexical inference , 2010, Natural Language Engineering.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[9]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[10]  Hoa A. Nguyen,et al.  A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[11]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[12]  Krzysztof Janowicz,et al.  The Effect of Context on Semantic Similarity Measurement , 2007, OTM Workshops.

[13]  K. Bretonnel Cohen,et al.  Ontology quality assurance through analysis of term transformations , 2009, Bioinform..

[14]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[15]  Elöd Egyed-Zsigmond,et al.  KWSim: Concepts Similarity Measure , 2008, CORIA.

[16]  Dong Xu,et al.  Data Mining in Biomedicine Using Ontologies , 2009 .

[17]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[18]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[19]  Yugyung Lee,et al.  Context-aware Data Mining using Ontologies , 2003 .

[20]  Ted Pedersen,et al.  Towards a framework for developing semantic relatedness reference standards , 2011, J. Biomed. Informatics.

[21]  Yugyung Lee,et al.  Context-Based Data Mining Using Ontologies , 2003, ER.

[22]  Natalia Grabar,et al.  How Can the Term Compositionality Be Useful for Acquiring Elementary Semantic Relations? , 2008, GoTAL.

[23]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[24]  Sidahmed Benabderrahmane,et al.  IntelliGO: a new vector-based semantic similarity measure including annotation origin , 2010, BMC Bioinformatics.

[25]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[26]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[27]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[28]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[29]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[30]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[31]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[32]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[33]  Elizabeth Chang,et al.  A context‐aware semantic similarity model for ontology environments , 2011, Concurr. Comput. Pract. Exp..

[34]  Hassan J. Eghbali,et al.  K-S Test for Detecting Changes from Landsat Imagery Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[35]  Euripides G. M. Petrakis,et al.  Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies , 1998 .

[36]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[37]  Noémie Elhadad,et al.  Mining a Lexicon of Technical Terms and Lay Equivalents , 2007, BioNLP@ACL.

[38]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[39]  Ying Li,et al.  Section classification in clinical notes using supervised hidden markov model , 2010, IHI.

[40]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[41]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[42]  A. Tversky Features of Similarity , 1977 .

[43]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[44]  Nuno Seco,et al.  Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content , 2008, OTM Conferences.

[45]  Viviana Mascardi,et al.  An Ontology-Based Similarity between Sets of Concepts , 2005, WOA.