An Approximation- Based Data Structure for Similarity Search

Many similarity measures for multimedia retrieval, decision support, and data mining are based on underlying vector spaces of high dimensionality. Data-partitioning index methods for such spaces (e.g. grid-les, quad-trees, R-trees, X-trees, etc.) generally work well for low-dimensional spaces, but perform poorly as dimensionality increases|a phenomenon which has become known as thèdimensional curse'. In this paper, we rst provide an analysis of the nearest-neighbor search problem in high-dimensional vector spaces. Under the assumptions of uniformity and independence, we establish bounds on the average performance of three important classes of data-partitioning techniques. We then introduce the vector-approximation le (VA-File), a method which overcomes the diiculties of high dimensionality by following not the data-partitioning approach of conventional index methods, but rather a lter-based approach. A VA-File contains a compact, geometric approximation for each vector. By rst scanning these smaller approximations, only a small fraction of the vectors themselves must be visited. Thus, the VA-File acts as a simple lter, much as a signature le is a lter. Performance is evaluated on the basis of both synthetic and real data sets, and compared to that of the R ?-tree and the X-tree. We show that performance does not degrade, and even improves with increased dimensionality. Both our analytical and our experimental results suggest that the VA-File is generally the preferred method for similarity search over moderate and large data sets with dimensionality in excess of around ten.

[1]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[2]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[3]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[4]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[5]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[6]  John G. Cleary,et al.  Analysis of an Algorithm for Finding Nearest Neighbors in Euclidean Space , 1979, TOMS.

[7]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[8]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[9]  Vldb Endowment,et al.  The VLDB journal : the international journal on very large data bases. , 1992 .

[10]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[11]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[12]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[13]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[14]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[15]  Christos H. Papadimitriou,et al.  On the analysis of indexing schemes , 1997, PODS '97.

[16]  Jon Louis Bentley,et al.  Data Structures for Range Searching , 1979, CSUR.

[17]  P. Gács,et al.  Algorithms , 1992 .

[18]  André Csillaghy Information extraction by local density analysis: a contribution to content based management of scientific data , 1997 .

[19]  Andre Csillaghy Retrieving information from digital solar radio spectrograms , 1995 .

[20]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[21]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[22]  Stefan Berchtold,et al.  A Cost Model For Nearest Neighbour Search , 1997, PODS 1997.

[23]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.