论文信息 - A template for the nearest neighbor problem

A template for the nearest neighbor problem

The NNP (nearest neighbor problem) is central to the solution of many practical applications. Also known as the “Post Office Problem,” it describes the need to find among a group of known positions, the point closest to some randomly chosen probe position. Obvious uses for such an algorithm are various map-lookup problems, cluster analysis, finite-element problems, string lookup, and computing intermolecular contacts between two protein molecules. It is usually assumed that lots of lookups will be done against a static dataset, so the cost of building the lookup structure is usually less important than the cost of the lookup itself. There are many methods in the literature for solving the NNP: kd-trees, quadtrees, and scan-line-based methods. In 1983, MacDonald and Kalantari [1] published an overlooked algorithm, based on what you would now call partition trees. Their algorithm may have been overlooked since it used a complicated double-linked tree, but simple recursion accomplishes the same result. It is optimal (in most cases) and fast; it has the advantage that storage is nearly linear in the number of objects in the database. Time to build the tree is proportional to (nlog n) for n points in the data set, and the time to retrieve the nearest neighbor to some probe point is proportional to (log n)(log m), where m is the dimensionality of the data. Over the years, I have used this algorithm to solve many problems from my own work. Rewritten in five computer languages, it has allowed me to do searches in large datasets, typically in 6, 7, 10, or up to 36 dimensions. Computing the intermolecular contacts between protein molecules in crystal was simple to implement and extremely fast to execute. Just for fun, I wrote a program to draw a line from a screen cursor to the nearest point among 20,000 random points on the screen; it worked in real time. A simple modification allows the method to return all points within a specified radius of the probe point. Joining the set of near points in real time to the cursor on the screen makes it look like a spider is crawling on the monitor. In translating the algorithm into C++, I needed to handle cases in several dimensions, sometimes within a single program, so a generic solution was important. Since the algorithm itself is independent of dimension, the creation of a template class seemed the obvious solution. That solution is presented here.

Larry Andrews | L. Andrews

[1] Iraj Kalantari,et al. A Data Structure and an Algorithm for the Nearest Point Problem , 1983, IEEE Transactions on Software Engineering.