The nearest-neighbor problem arises in clustering and other applications. It requires us to define a function to measure differences among items in a data set, and then to compute the closest items to a query point with respect to this measure. Recent work suggests that the conventional Euclidean measure does not adequately model highdimensional data. We present a new, data-driven difference measure for categorical data for which the difference between two data points is based on the frequency of the categories or combinations of categories that they have in common. This measure addresses the main flaw of the Euclidean distance measure—namely, that it treats each dimension independently. We then provide both brute-force algorithms and an efficient, but approximate, probabilistic algorithm to compute the nearest neighbors of a query point with respect to this measure. Finally, we illustrate a practical application of our approach in a recommendation engine built for the Tower Records online video and DVD catalog.
[1]
Jonathan Goldstein,et al.
When Is ''Nearest Neighbor'' Meaningful?
,
1999,
ICDT.
[2]
Hans-Peter Kriegel,et al.
The X-tree : An Index Structure for High-Dimensional Data
,
2001,
VLDB.
[3]
B. Silverman.
Density estimation for statistics and data analysis
,
1986
.
[4]
Michael Ian Shamos,et al.
Closest-point problems
,
1975,
16th Annual Symposium on Foundations of Computer Science (sfcs 1975).
[5]
Oliver Günther,et al.
Multidimensional access methods
,
1998,
CSUR.
[6]
Hans-Jörg Schek,et al.
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces
,
1998,
VLDB.