Distributed query-aware quantization for high-dimensional similarity searches

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic. There is potential to improve accuracy when a query-dependent quantization is used. In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions.

[1]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Philip S. Yu,et al.  The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space , 2000, KDD '00.

[3]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[4]  Denis Rinfret Answering preference queries with bit-sliced index arithmetic , 2008, C3S2E '08.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Abdelkader Hameurlain,et al.  Transactions on Large-Scale Data- and Knowledge-Centered Systems XIV , 2014, Lecture Notes in Computer Science.

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  Anthony K. H. Tung,et al.  Similarity search: a matching based approach , 2006, VLDB.

[9]  Patrick E. O'Neil,et al.  Bit-sliced index arithmetic , 2001, SIGMOD '01.

[10]  H. Fawcett Manual of Political Economy , 1995 .

[11]  Arie Shoshani,et al.  Compressing bitmap indexes for faster search operations , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[12]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[13]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[14]  Raghuraman Mudumbai,et al.  2016 Ieee International Conference on Big Data (big Data) Power Efficient Big Data Analytics Algorithms through Low-level Operations , 2022 .

[15]  Kesheng Wu,et al.  Notes on design and implementation of compressed bit vectors , 2001 .

[16]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[17]  Roberto J. Bayardo,et al.  Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA, August 20-23, 2000 , 2000, Knowledge Discovery and Data Mining.

[18]  Owen Kaser,et al.  Reordering rows for better compression: Beyond the lexicographic order , 2012, TODS.

[19]  Beng Chin Ooi,et al.  DSH: data sensitive hashing for high-dimensional k-nnsearch , 2014, SIGMOD Conference.

[20]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[21]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[23]  Abdesselam Bouzerdoum,et al.  Skin segmentation using color pixel classification: analysis and comparison , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[25]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[27]  Guadalupe Canahuate,et al.  Hybrid query optimization for hard-to-compress bit-vectors , 2015, The VLDB Journal.

[28]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[29]  Edward Y. Chang,et al.  DynDex: a dynamic and non-metric space indexer , 2002, MULTIMEDIA '02.

[30]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[31]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[32]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[33]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[35]  Guadalupe Canahuate,et al.  Supporting Dynamic Quantization for High-Dimensional Data Analytics , 2017, ExploreDB@SIGMOD/PODS.

[36]  Guadalupe Canahuate,et al.  Slicing the Dimensionality: Top-k Query Processing for High-Dimensional Spaces , 2014, Trans. Large Scale Data Knowl. Centered Syst..

[37]  Guadalupe Canahuate,et al.  A tunable compression framework for bitmap indices , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[38]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[39]  Guadalupe Canahuate,et al.  A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data , 2016, IDEAS.

[40]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[41]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[42]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[43]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[44]  Desai,et al.  Proceedings of the 2008 C 3 S 2 E conference , 2008 .

[45]  Hong Joo Lee,et al.  Use of social network information to enhance collaborative filtering performance , 2010, Expert Syst. Appl..