论文信息 - An Approximation- Based Data Structure for Similarity Search - 字舞流文

An Approximation- Based Data Structure for Similarity Search

Many similarity measures for multimedia retrieval, decision support, and data mining are based on underlying vector spaces of high dimensionality. Data-partitioning index methods for such spaces (e.g. grid-les, quad-trees, R-trees, X-trees, etc.) generally work well for low-dimensional spaces, but perform poorly as dimensionality increases|a phenomenon which has become known as thèdimensional curse'. In this paper, we rst provide an analysis of the nearest-neighbor search problem in high-dimensional vector spaces. Under the assumptions of uniformity and independence, we establish bounds on the average performance of three important classes of data-partitioning techniques. We then introduce the vector-approximation le (VA-File), a method which overcomes the diiculties of high dimensionality by following not the data-partitioning approach of conventional index methods, but rather a lter-based approach. A VA-File contains a compact, geometric approximation for each vector. By rst scanning these smaller approximations, only a small fraction of the vectors themselves must be visited. Thus, the VA-File acts as a simple lter, much as a signature le is a lter. Performance is evaluated on the basis of both synthetic and real data sets, and compared to that of the R ?-tree and the X-tree. We show that performance does not degrade, and even improves with increased dimensionality. Both our analytical and our experimental results suggest that the VA-File is generally the preferred method for similarity search over moderate and large data sets with dimensionality in excess of around ten.

Stephen Blott | Roger Weber | Michele Degli Esposti | Letizia Falcone | Algoritmo Va-Ssa | R. Weber | S. Blott | M. D. Esposti | L. Falcone | Algoritmo Va-Ssa | Stephen Blott

[1] Jeffrey D. Ullman,et al. Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[2] Christos Faloutsos,et al. Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[3] Shin'ichi Satoh,et al. The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[4] Antonin Guttman,et al. R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[5] Christos Faloutsos,et al. Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[6] John G. Cleary,et al. Analysis of an Algorithm for Finding Nearest Neighbors in Euclidean Space , 1979, TOMS.

[7] Christian Böhm,et al. Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[8] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[9] Vldb Endowment,et al. The VLDB journal : the international journal on very large data bases. , 1992 .

[10] Hanan Samet,et al. The Design and Analysis of Spatial Data Structures , 1989 .

[11] Hanan Samet,et al. Ranking in Spatial Databases , 1995, SSD.

[12] Markus A. Stricker,et al. Similarity of color images , 1995, Electronic Imaging.

[13] Yossi Matias,et al. Fast incremental maintenance of approximate histograms , 1997, TODS.

[14] Hans-Peter Kriegel,et al. The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[15] Christos H. Papadimitriou,et al. On the analysis of indexing schemes , 1997, PODS '97.

[16] Jon Louis Bentley,et al. Data Structures for Range Searching , 1979, CSUR.

[17] P. Gács,et al. Algorithms , 1992 .

[18] André Csillaghy. Information extraction by local density analysis: a contribution to content based management of scientific data , 1997 .

[19] Andre Csillaghy. Retrieving information from digital solar radio spectrograms , 1995 .

[20] Pavel Zezula,et al. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[21] Jeffrey F. Naughton,et al. Generalized Search Trees for Database Systems , 1995, VLDB.

[22] Stefan Berchtold,et al. A Cost Model For Nearest Neighbour Search , 1997, PODS 1997.

[23] Dragutin Petkovic,et al. Query by Image and Video Content: The QBIC System , 1995, Computer.