Finding Similar Objects Using a Taxonomy: A Pragmatic Approach

Several authors have suggested similarity measures for objects labeled with terms from a hierarchical taxonomy We generalize this idea with a definition of information-theoretic similarity for taxonomies that are structured as directed acyclic graphs from which multiple terms may be used to describe an object We discuss how our definition should be adapted in the presence of ambiguity, and introduce new similarity measures based on our definitions. We present an implementation of our measures that is integrated with a relational database and scales to large taxonomies and datasets We evaluate our measures by applying them to an object-matching problem from bioinformatics, and show that, for this task, our new measures outperform those reported in the literature We also verified the scalability of our approach by applying it to patent similarity search, using patents classified with terms from the taxonomy defined by the United States Patent and Trademark Office.

[1]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  James M. Keller,et al.  Taxonomy-based soft similarity measures in bioinformatics , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).

[4]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  Constantin V. Negoita,et al.  On Fuzzy Systems , 1978 .

[8]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[9]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  Iraklis Varlamis,et al.  THESUS: Organizing Web Doc-ument Collections Based On Semantics And Clustering , 2002 .

[12]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[13]  Iraklis Varlamis,et al.  THESUS: Organizing Web document collections based on link semantics , 2003, The VLDB Journal.