Finding k-Closest-Pairs Efficiently for High Dimensional Data

We present a novel approach to report approximate as well as exact k-closest pairs for sets of high dimensional points, under the L t-metric, t = 1; : : : ; 1. The proposed algorithms are eecient and simple to implement. They all use multiple shifted copies of the data points sorted according to their position along a space lling curve, such as the Peano curve, in a way that allows us to make performance guarantees and without assuming that the dimensionality d is constant. The rst algorithm computes an O(d 1+1=t) approximation to the k th closest pair distance in O(d 2 n log +dk(d + log k)) time. Experimental results, obtained using various real data sets of varying dimensions, indicate that the approximation factor is much better in practice. In the second algorithm we use this approximation in order to nd the exact k closest pairs in O(dM) additional time, where M is the number of points in certain short subsegments of the space-lling curve. The exact algorithm is particularly eecient and M = O(k) can be guaranteed, when presented with data sets that satisfy certain separation conditions. The proposed approach can be easily adapted to other proximity problems, including xed-radius neighbor search, minimal k-point clustering, and nearest neighbor search.

[1]  Michael Ian Shamos,et al.  Divide-and-conquer in multidimensional space , 1976, STOC '76.

[2]  Jon Louis Bentley,et al.  The Complexity of Finding Fixed-Radius Near Neighbors , 1977, Inf. Process. Lett..

[3]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[4]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[5]  Bernard Chazelle An Improved Algorithm for the Fixed-Radius Neighbor Problem , 1983, Inf. Process. Lett..

[6]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[7]  Herbert Edelsbrunner,et al.  Rectangular Point Location in d Dimensions with Applications , 1986, Comput. J..

[8]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[9]  Matthew Dickerson,et al.  Fixed-Radius Near Neighbors Search Algorithms for Points and Segments , 1990, Inf. Process. Lett..

[10]  M. Smid Maintaining the minimal distance of a point set in less than linear time , 1990 .

[11]  Volker Turau Fixed-Radius Near Neighbors Search , 1991, Inf. Process. Lett..

[12]  Michiel H. M. Smid,et al.  Enumerating the k closest pairs optimally , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[13]  Jirí Matousek,et al.  Ray shooting and parametric search , 1992, STOC '92.

[14]  David Eppstein,et al.  Iterated nearest neighbors and finding minimal polytopes , 1993, SODA '93.

[15]  Michiel H. M. Smid,et al.  Static and Dynamic Algorithms for k-Point Clustering Problems , 1993, J. Algorithms.

[16]  Jirí Matousek,et al.  Ray Shooting and Parametric Search , 1993, SIAM J. Comput..

[17]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[18]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[19]  Timothy M. Chan Approximate Nearest Neighbor Queries Revisited , 1997, SCG '97.

[20]  Bir Bhanu,et al.  Learning feature relevance and similarity metrics in image databases , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[21]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[22]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.