Making the Nearest Neighbor Meaningful.PDF

The nearest-neighbor problem arises in clustering and other applications. It requires us to define a function to measure differences among items in a data set, and then to compute the closest items to a query point with respect to this measure. Recent work suggests that the conventional Euclidean measure does not adequately model highdimensional data. We present a new, data-driven difference measure for categorical data for which the difference between two data points is based on the frequency of the categories or combinations of categories that they have in common. This measure addresses the main flaw of the Euclidean distance measure—namely, that it treats each dimension independently. We then provide both brute-force algorithms and an efficient, but approximate, probabilistic algorithm to compute the nearest neighbors of a query point with respect to this measure. Finally, we illustrate a practical application of our approach in a recommendation engine built for the Tower Records online video and DVD catalog.