Contractive Embedding Methods for Similarity Searching in Metric Spaces

Complex data types (e.g., images, documents, DNA sequences, etc) are becoming increasingly important in database applications. The term multimedia database is often used to characterize such databases. A typical query for such data seeks to nd objects that are similar to some target object, where (dis)similarity is de ned by some distance function. Often, the cost of evaluating the distance of two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space, such that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. In this paper, our focus is on embedding methods that allow returning the same query result as if the actual distances of the objects are consulted, thus ensuring that no relevant objects are left out (i.e., there are no false dismissals). Particular attention was paid to SparseMap, a variant of Lipschitz embeddings, and FastMap, which is designed to be a heuristic alternative to the KLT (and the equivalent PCA and SVD) method for dimensionality reduction. We show that neither SparseMap nor FastMap guarantee that queries on the embedded objects have no false dismissals. However, we describe a variant of SparseMap allows queries with no false dismissals. Moreover, we show that with FastMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap to guarantee no false dismissals. This work was supported in part by the National Science Foundation under Grant IRI-97-12715.

[1]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[2]  Myron Wish,et al.  Three-Way Multidimensional Scaling , 1978 .

[3]  Anil K. Jain,et al.  An Intrinsic Dimensionality Estimator from Near-Neighbor Information , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[6]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[7]  Forrest W. Young Multidimensional Scaling: History, Theory, and Applications , 1987 .

[8]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[9]  M. Zwaan An introduction to hilbert space , 1990 .

[10]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[11]  N. Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, FOCS.

[12]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[13]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[14]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[16]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[19]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[20]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[21]  H. Gabriela,et al.  Cluster-preserving Embedding of Proteins , 1999 .

[22]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[23]  H. Samet,et al.  Incremental Similarity Search in Multimedia Databases , 2000 .