High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[3]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[4]  Hong Joo Lee,et al.  Use of social network information to enhance collaborative filtering performance , 2010, Expert Syst. Appl..

[5]  H. Fawcett Manual of Political Economy , 1995 .

[6]  Arie Shoshani,et al.  Compressing bitmap indexes for faster search operations , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[7]  Beng Chin Ooi,et al.  DSH: data sensitive hashing for high-dimensional k-nnsearch , 2014, SIGMOD Conference.

[8]  Roger Weber,et al.  Parallel Va-file , 1997 .

[9]  Guadalupe Canahuate,et al.  Distributed query-aware quantization for high-dimensional similarity searches , 2018, EDBT.

[10]  Abdesselam Bouzerdoum,et al.  Skin segmentation using color pixel classification: analysis and comparison , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Hans-Jörg Schek,et al.  Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files , 2000, ECDL.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Guadalupe Canahuate,et al.  A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data , 2016, IDEAS.

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[17]  Philip S. Yu,et al.  The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space , 2000, KDD '00.

[18]  Denis Rinfret Answering preference queries with bit-sliced index arithmetic , 2008, C3S2E '08.

[19]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[20]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[21]  Raghuraman Mudumbai,et al.  2016 Ieee International Conference on Big Data (big Data) Power Efficient Big Data Analytics Algorithms through Low-level Operations , 2022 .

[22]  Guadalupe Canahuate,et al.  Hybrid query optimization for hard-to-compress bit-vectors , 2015, The VLDB Journal.

[23]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[25]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[26]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[27]  Edward Y. Chang,et al.  DynDex: a dynamic and non-metric space indexer , 2002, MULTIMEDIA '02.

[28]  Guadalupe Canahuate,et al.  Slicing the Dimensionality: Top-k Query Processing for High-Dimensional Spaces , 2014, Trans. Large Scale Data Knowl. Centered Syst..

[29]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[30]  Guadalupe Canahuate,et al.  A tunable compression framework for bitmap indices , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[31]  Guadalupe Canahuate,et al.  Supporting Dynamic Quantization for High-Dimensional Data Analytics , 2017, ExploreDB@SIGMOD/PODS.

[32]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[33]  Kesheng Wu,et al.  Notes on design and implementation of compressed bit vectors , 2001 .

[34]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[35]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[36]  Stephen Blott,et al.  An Approximation- Based Data Structure for Similarity Search , 2006 .

[37]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[38]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[40]  Yunjun Gao,et al.  Pivot-based Metric Indexing , 2017, Proc. VLDB Endow..

[41]  Anthony K. H. Tung,et al.  Similarity search: a matching based approach , 2006, VLDB.

[42]  Patrick E. O'Neil,et al.  Bit-sliced index arithmetic , 2001, SIGMOD '01.

[43]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[45]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[46]  Owen Kaser,et al.  Reordering rows for better compression: Beyond the lexicographic order , 2012, TODS.

[47]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[48]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[49]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[50]  Andrea Esuli,et al.  A comparison of pivot selection techniques for permutation-based indexing , 2015, Inf. Syst..

[51]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[52]  Benjamin Bustos,et al.  Analyzing and dynamically indexing the query set , 2014, Inf. Syst..