K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free

Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.

[1]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[2]  Mario A. López,et al.  High dimensional similarity search with space filling curves , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Surajit Chaudhuri,et al.  Towards a robust query optimizer: a principled and practical approach , 2005, SIGMOD '05.

[4]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[5]  Volker Markl,et al.  Integrating the UB-Tree into a Database System Kernel , 2000, VLDB.

[6]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[8]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[9]  Timothy M. Chan Closest-point problems simplified on the RAM , 2002, SODA '02.

[10]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[11]  Anteneh Ayanso,et al.  Efficient processing of k-nearest neighbor queries over relational databases: a cost-based optimization , 2004 .

[12]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[13]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[14]  Dimitris Papadias,et al.  Multiway spatial joins , 2001, ACM Trans. Database Syst..

[15]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[16]  Hanan Samet,et al.  Spatial join techniques , 2007, TODS.

[17]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[18]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[19]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[20]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[21]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[22]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[23]  Dražen Odobašić Open Street Map , 2009 .

[24]  Timothy M. Chan Approximate Nearest Neighbor Queries Revisited , 1997, SCG '97.

[25]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[26]  Christian Böhm,et al.  High performance data mining using the nearest neighbor join , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[27]  Kothuri Venkata Ravi Kanth,et al.  Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data , 2002, SIGMOD '02.

[28]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[29]  Cyrus Shahabi,et al.  Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases , 2004, VLDB.

[30]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[31]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[32]  SuJianwen,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007 .

[33]  Yufei Tao,et al.  Nearest Neighbor Queries in Network Databases , 2017, Encyclopedia of GIS.

[34]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.