Efficient index-based KNN join processing for high-dimensional data

In many advanced database applications (e.g., multimedia databases), data objects are transformed into high-dimensional points and manipulated in high-dimensional space. One of the most important but costly operations is the similarity join that combines similar points from multiple datasets. In this paper, we examine the problem of processing K-nearest neighbor similarity join (KNN join). KNN join between two datasets, R and S, returns for each point in R its K most similar points in S. We propose a new index-based KNN join approach using the iDistance as the underlying index structure. We first present its basic algorithm and then propose two different enhancements. In the first enhancement, we optimize the original KNN join algorithm by using approximation bounding cubes. In the second enhancement, we exploit the reduced dimensions of data space. We conducted an extensive experimental study using both synthetic and real datasets, and the results verify the performance advantage of our schemes over existing KNN join algorithms.

[1]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[2]  Kyuseok Shim,et al.  High-dimensional similarity joins , 1997, Proceedings 13th International Conference on Data Engineering.

[3]  Gene H. Golub,et al.  Matrix computations , 1983 .

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[6]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[7]  Christian Böhm,et al.  Optimal Dimension Order: A Generic Technique for the Similarity Join , 2002, DaWaK.

[8]  Christian Böhm,et al.  Supporting KDD Applications by the k-Nearest Neighbor Join , 2003, DEXA.

[9]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[10]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[11]  Yannis Manolopoulos,et al.  VA-Files vs. R*-Trees in Distance Join Queries , 2005, ADBIS.

[12]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[13]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[14]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[15]  Agnès Voisard,et al.  Spatial Databases: With Application to GIS , 2001 .

[16]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[17]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[18]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[19]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[20]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[21]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[22]  Christian Böhm,et al.  High performance data mining using the nearest neighbor join , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD 2000.

[24]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[25]  Jing Hu,et al.  Adaptive Quantization of the High-Dimensional Data for Efficient KNN Processing , 2004, DASFAA.

[26]  Yufei Tao,et al.  All-nearest-neighbors queries in spatial databases , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..