The Stepwise Dimensionality Increasing (SDI) Index for High-Dimensional Data

Similarity search is a powerful paradigm for image and multimedia databases, time series databases, and DNA and protein sequence databases. Objects are represented by high-dimensional feature vectors based on color, texture, and shape, in the case of images, for example object similarity is usually implemented via k-nearest neighbor (k-NN) queries, determined by the distance of the endpoints of the feature vectors. The cost of processing k-NN queries via a sequential scan increases with the number of objects and the number of dimensions. Multi-dimensional indexing structures can be used to improve the efficiency of k-NN query processing, but lose their effectiveness as the dimensionality increases. The curse of dimensionality manifests itself in the form of increased overlap among the nodes of the index, so that a high fraction of index pages are touched in processing k-NN queries. The increased dimensionality results in a reduced fanout and an increased index height. We propose a stepwise dimensionality increasing (SDI)-tree index, which aims at reducing the number of disk accesses and CPU processing cost. The index is built using feature vectors transformed via principal component analysis. Dimensions are retained in non-increasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. The optimal value for p is determined experimentally. Experiments on three datasets have shown that SDI-trees access fewer disk pages and incur less CPU time than SR-trees, VAMSR-trees, vector approximation (VA)-Files and the recently proposed iDistance method. In CPU time SDI outperforms the sequential scan and OMNI methods.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Beng Chin Ooi,et al.  Indexing high-dimensional data for efficient in-memory similarity search , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Alexander Thomasian,et al.  High-dimensional indexing methods utilizing clustering and dimensionality reduction , 2005 .

[4]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[5]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[6]  Christos Faloutsos,et al.  Fast and Effective Retrieval of Medical Tumor Shapes , 1998, IEEE Trans. Knowl. Data Eng..

[7]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[8]  Yue Li,et al.  Exact k-NN queries on clustered SVD datasets , 2005, Inf. Process. Lett..

[9]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[10]  Vittorio Castelli,et al.  Image Databases: Search and Retrieval of Digital Imagery , 2002 .

[11]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[12]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[13]  Alexander Thomasian,et al.  CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Vittorio Castelli,et al.  Multidimensional Indexing Structures for Content‐Based Retrieval , 2002 .

[15]  Aidong Zhang,et al.  ClusterTree: Integration of Cluster Representation and Nearest-Neighbor Search for Large Data Sets with High Dimensions , 2003, IEEE Trans. Knowl. Data Eng..

[16]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[17]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[18]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[19]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[20]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[22]  Alexander Thomasian,et al.  Clustering and singular value decomposition for approximate indexing in high dimensional spaces , 1998, CIKM '98.

[23]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[24]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[25]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[26]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.