Finding Top-k Similar Pairs of Objects Annotated with Terms from an Ontology

With the growing focus on semantic searches, an increasing number of standardized ontologies are being designed to describe data. We investigate the querying of objects described by a tree-structured ontology. Specifically, we consider the case of finding the top-kbest pairs of objects that have been annotated with terms from such an ontology when the object descriptions are available only at runtime. We consider three distance measures. The first one defines the object distance as the minimum pairwise distance between the sets of terms describing them and the second one defines the distance as the average pairwise term distance. The third and most useful distance measure--earth mover's distance-- finds the best way of matching the terms and computes the distance corresponding to this best matching. We develop lower bounds that can be aggregated progressively and utilize them to speed up the search for top-kobject pairs when the earth mover's distance is used. For the minimum pairwise distance, we devise an algorithm that runs inO(D + Tklogk) time, whereDis the total information size andTis the number of terms in the ontology. We also develop a best-first search strategy for the average pairwise distance that utilizes lower bounds generated in an ordered manner. Experiments on real and synthetic datasets demonstrate the practicality and scalability of our algorithms.

[1]  Carsten Wiuf,et al.  Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps , 2007, J. Biomed. Informatics.

[2]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[3]  Olivier Bodenreider,et al.  Co-evolutionary Rates of Functionally Related Yeast Genes , 2006, Evolutionary bioinformatics online.

[4]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[5]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[6]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[7]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[8]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[9]  Ambuj K. Singh,et al.  Indexing Spatially Sensitive Distance Measures Using Multi-resolution Lower Bounds , 2006, EDBT.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[12]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[13]  Anupam Gupta Embedding Tree Metrics into Low-Dimensional Euclidean Spaces , 2000, Discret. Comput. Geom..

[14]  Ira Assent,et al.  Approximation Techniques for Indexing the Earth Mover’s Distance in Multimedia Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[16]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[17]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[18]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[19]  Nicholas Kushmerick,et al.  Web Service aggregation with string distance ensembles and active probe selection , 2008, Inf. Fusion.

[20]  Xiaojun Wan,et al.  The earth mover's distance as a semantic measure for document similarity , 2005, CIKM '05.