Finding near neighbors through cluster pruning

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

[1]  Jonathan Goldstein,et al.  Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches , 2000, VLDB.

[2]  A. Nanopoulos,et al.  R-trees Have Grown Everywhere , 2003 .

[3]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1997, STOC '97.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[6]  Herbert Edelsbrunner,et al.  Algorithms in Combinatorial Geometry , 1987, EATCS Monographs in Theoretical Computer Science.

[7]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[8]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[9]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[10]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[11]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[12]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[13]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[14]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[15]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[16]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[17]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[19]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[20]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[21]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[22]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[23]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.

[24]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[25]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.