Vector approximation based indexing for non-uniform high dimensional data sets

With the proliferation of multimedia data, there is increasing need to support the indexing and searching of high dimensional data. Recently, a vector approximation based technique called VAle has been proposed for indexing high dimensional data. It has been shown that the VAle is an e ective technique compared to the current approaches based on space and data partitioning. The VAle gives good performance especially when the data set is uniformly distributed. Real data sets are not uniformly distributed, are often clustered, and the dimensions of the feature vectors in real data sets are usually correlated. More careful analysis for nonuniform or correlated data is needed for e ectively indexing high dimensional data. We propose a solution to these problems and propose the VAle, a new technique for indexing high dimensional data sets based on vector approximations. We conclude with an evaluation of nearest neighbor queries and show that the VAle technique results in signi cant improvements over the current VAle approach for several real data sets.

[1]  Kari Karhunen,et al.  Über lineare Methoden in der Wahrscheinlichkeitsrechnung , 1947 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[4]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[6]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[7]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[8]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[9]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[10]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[11]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[12]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[14]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[16]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[17]  Hans-Peter Kriegel,et al.  S3: similarity search in CAD database systems , 1997, SIGMOD '97.

[18]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[19]  Michael Stonebraker,et al.  The Asilomar report on database research , 1998, SGMD.

[20]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[21]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[22]  Ambuj K. Singh,et al.  Scalable access within the context of digital libraries , 1998, International Journal on Digital Libraries.

[23]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[24]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[25]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[26]  Hai Jin,et al.  Active Disks: Programming Model, Algorithms and Evaluation , 2002 .