Indexing Schemes for Similarity Search: an Illustrated Paradigm

We suggest a variation of the Hellerstein--Koutsoupias--Papadimitriou indexability model for datasets equipped with a similarity measure, with the aim of better understanding the structure of indexing schemes for similarity-based search and the geometry of similarity workloads. This in particular provides a unified approach to a great variety of schemes used to index into metric spaces and facilitates their transfer to more general similarity measures such as quasi-metrics. We discuss links between performance of indexing schemes and high-dimensional geometry. The concepts and results are illustrated on a very large concrete dataset of peptide fragments equipped with a biologically significant similarity measure.

[1]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[2]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[3]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[4]  V. Milman Topics in Asymptotic Geometric Analysis , 2000 .

[5]  N Hunt,et al.  … -ome sweet -ome , 2001, Redox report : communications in free radical research.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[8]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[9]  Christos H. Papadimitriou,et al.  On the analysis of indexing schemes , 1997, PODS '97.

[10]  N. Alon The linear arboricity of graphs , 1988 .

[11]  M. Ledoux The concentration of measure phenomenon , 2001 .

[12]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[13]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[14]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[15]  M. Gromov Metric Structures for Riemannian and Non-Riemannian Spaces , 1999 .

[16]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Daniel P. Miranker,et al.  On a model of indexability and its bounds for range queries , 2002, JACM.

[19]  Vladimir Pestov,et al.  A geometric framework for modelling similarity search , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[20]  Christos Faloutsos,et al.  The "DGX" distribution for mining massive, skewed data , 2001, KDD '01.

[21]  Christos Faloutsos,et al.  Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[22]  Jiri Matousek,et al.  Lectures on discrete geometry , 2002, Graduate texts in mathematics.

[23]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[24]  Christos H. Papadimitriou Database metatheory: asking the big queries , 1995, PODS '95.

[25]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[26]  M. Gromov,et al.  Mathematical slices of molecular biology , 2001 .