Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations

In this paper, we propose a new bulk-loading technique for high-dimensional indexes which represent an important component of multimedia database systems. Since it is very inefficient to construct an index for a large amount of data by dynamic insertion of single objects, there is an increasing interest in bulk-loading techniques. In contrast to previous approaches, our technique exploits a priori knowledge of the complete data set to improve both construction time and query performance. Our algorithm operates in a mannar similar to the Quicksort algorithm and has an average runtime complexity of O(n log n). We additionally improve the query performance by optimizing the shape of the bounding boxes, by completely avoiding overlap, and by clustering the pages on disk. As we analytically show, the split strategy typically used in dynamic index structures, splitting the data space at the 50%-quantile, results in a bad query performance in high-dimensional spaces. Therefore, we use a sophisticated unbalanced split strategy, which leads to a much better space partitioning. An exhaustive experimental evaluation shows that our technique clearly outperforms both classic index construction and competitive bulk loading techniques. In comparison with dynamic index construction we achieve a speed-up factor of up to 588 for the construction time. The constructed index causes up to 16.88 times fewer page accesses and is up to 198 times faster (real time) in query processing.

[1]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[2]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  Bernhard Seeger,et al.  A Generic Approach to Bulk Loading Multidimensional Index Structures , 1997, VLDB.

[5]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[6]  Hans-Peter Kriegel,et al.  S3: similarity search in CAD database systems , 1997, SIGMOD '97.

[7]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[8]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[9]  John P. Oakley,et al.  Storage and Retrieval for Image and Video Databases , 1993 .

[10]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[11]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[12]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[13]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[14]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[15]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[16]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[17]  Rajiv Mehrotra,et al.  Feature-based retrieval of similar shapes , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[18]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.