Fast Approximated Nearest Neighbor Joins For Relational Database Systems

K nearest neighbor search (kNN-Search) is a universal data processing technique and a fundamental operation for word embeddings trained by word2vec or related approaches. The benefits of operations on dense vectors like word embeddings for analytical functionalities of RDBMSs motivate an integration of kNN-Joins. However, kNN-Search, as well as kNN-Joins, have barely been integrated into relational database systems so far. In this paper, we develop an index structure for approximated kNN-Joins working well on high-dimensional data and provide an integration into PostgreSQL. The novel index structure is efficient for different cardinalities of the involved join partners. An evaluation of the system based on applications on word embeddings shows the benefits of such an integrated kNN-Join operation and the performance of the proposed approach.

[1]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Victor Lempitsky,et al.  The inverted multi-index , 2012, CVPR.

[3]  Anne-Marie Kermarrec,et al.  Cache locality is not enough: High-Performance Nearest Neighbor Search with Product Quantization Fast Scan , 2015, Proc. VLDB Endow..

[4]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[7]  Michael Günther FREDDY: Fast Word Embeddings in Database Systems , 2018, SIGMOD Conference.

[8]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[9]  Heiko Schuldt,et al.  ADAM - A Database and Information Retrieval System for Big Multimedia Collections , 2014, 2014 IEEE International Congress on Big Data.

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[12]  Feifei Li,et al.  K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Ivor W. Tsang,et al.  Online Product Quantization , 2017, IEEE Transactions on Knowledge and Data Engineering.

[14]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[15]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Karel Jezek,et al.  Comparing Semantic Models for Evaluating Automatic Document Summarization , 2015, TSD.

[18]  Oded Shmueli,et al.  Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities , 2017, ArXiv.