Cluster Analysis for Optimal Indexing

High-dimensional indexing is an important area of current research, especially for range and kNN queries. This work introduces clustering for the sake of indexing. The goal is to develop new clustering methods designed to optimize the data partitioning for an indexing-specific tree structure instead of finding data distribution-based clusters. We focus on iDistance, a state-of-the-art high-dimensional indexing method, and take a basic approach to solving this new problem. By utilizing spherical clusters in an unsupervised Expectation Maximization algorithm dependent upon local density and cluster overlap, we create a partitioning of the space providing balanced segmentation for a B+-tree. We also look at the novel idea of reclustering for a specific indexing method by taking the output of one clustering method and reclustering it for use in an index. The algorithms are then tested and evaluated based on our error metric and iDistance query performance.

[1]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[2]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[3]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[4]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[7]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[8]  Beng Chin Ooi,et al.  Indexing the edges—a simple and yet efficient approach to high-dimensional indexing , 2000, PODS.

[9]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[10]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[11]  A. D. Gordon 359. Note: Classification in the Presence of Constraints , 1973 .

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .