ANN: library for approximate nearest neighbor searching

ANN is a library of C++ objects and procedures that supports approximate nearest neighbor searching. In nearest neighbor searching, we are given a set of data points S in real d-dimensional space, R d , and are to build a data structure such that, given any query point q 2 R d , the nearest data point to q can be found eeciently. In general, we are given k 1, and are asked to return the k-nearest neighbors to q in S. In approximate nearest neighbor searching, an error bound 0 is also given. The search algorithm returns k distinct points of S, such that the ratio between the distance to the ith point reported and the true ith nearest neighbor is at most 1 +. Among the features of ANN are the following. It supports k-nearest neighbor searching, by specifying k with the query. It supports both exact and approximate nearest neighbor searching, by specifying an approximation factor 0 with the query. It supports all Minkowski distance metrics, including the L 1 (Manhattan), L 2 (Eu-clidean), and L 1 (Max) metrics. There are no exponential factors in space, implying that the data structure is practical even for very large data sets in high dimensional spaces, irrespective of. ANN is written as a testbed for a class of nearest neighbor searching algorithms, particularly those based on orthogonal decompositions of space. These include k-d trees 3, 4], balanced box-decomposition trees 2] and other related spatial data structures (see Samet 5]). The library supports a number of diierent methods for building search structures. It also supports two methods for searching these structures: standard tree-ordered search 1] and priority search 2]. In priority search, the cells of the data structure are visited in increasing order of distance from the query point. In addition to the library there are two programs provided for testing and evaluating the performance of various search methods. The rst, called ann test, provides a primitive script language that allows the user to generate data sets and query sets, either by reading from a le or randomly through the use of a number of built-in point distributions. Any of a