A partitioning method for high dimensional data

Nearest neighbor search in high-dimensional space is an important operation in many applications, such as data mining and multimedia database. Evaluating similarities of a point to all other points in high-dimensional space need the high computational cost. For reducing the computational cost, index-structures are frequently used. Most of these index-structures are built by partitioning the data set based on a specific criterion. However, partitioning approaches potentially have a problem failing to find the nearest neighbor which is caused by disjoint partitions. In this paper, we propose an Error Minimizing Partitioning (E-MP) method with a novel tree structure, which minimizes the failure problem in finding the nearest neighbors. E-MP divides the data into subsets with considering the distribution of data set. For partitioning data set, the proposed method finds the first principal component of the data set using the principal component analysis (PCA). And then, the method finds the centroid of data set. Finally, it decides the partitioning hyper-plane that passes the centroid and is perpendicular to the principal component vector. We also make a comparative study of existing methods and the proposed method, to verify the usability of our method.

[1]  Ming Zhang,et al.  Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space , 2010, Knowledge and Information Systems.

[2]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[3]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[4]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[5]  H. Buchner The Grid File : An Adaptable , Symmetric Multikey File Structure , 2001 .

[6]  Mark Klein,et al.  Massachusetts Institute of Technology Abraham Bernstein University of Zurich Toward High-Precision Service Retrieval , 2022 .

[7]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[8]  Ada Wai-Chee Fu,et al.  Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances , 2000, The VLDB Journal.

[9]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[10]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[11]  Yanchun Zhang,et al.  Web Service Composition with Case-Based Reasoning , 2003, ADC.

[12]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[13]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.