Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring

Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.

[1]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[2]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[3]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[4]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[5]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[6]  Xuemin Lin,et al.  SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index , 2014, Proc. VLDB Endow..

[7]  Yuzuru Tanaka,et al.  Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007, WADS.

[8]  Anshumali Shrivastava,et al.  Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search , 2017, SIGMOD Conference.

[9]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[10]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[11]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Anthony K. H. Tung,et al.  A Generic Inverted Index Framework for Similarity Search on the GPU , 2016, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[14]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[15]  Deng Cai,et al.  Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph , 2017, Proc. VLDB Endow..

[16]  Michael S. Waterman,et al.  An extreme value theory for long head runs , 1986 .

[17]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[18]  Zi Huang,et al.  SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search , 2014, Proc. VLDB Endow..

[19]  KatayamaNorio,et al.  The SR-tree , 1997 .

[20]  Qiang Huang,et al.  Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm , 2017, The VLDB Journal.

[21]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[22]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[23]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[24]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[26]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[27]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[28]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[29]  Anthony K. H. Tung,et al.  LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index , 2016, SIGMOD Conference.

[30]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[31]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[32]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[33]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[34]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[36]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[37]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[38]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[39]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[40]  Anthony K. H. Tung,et al.  Sublinear Time Nearest Neighbor Search over Generalized Weighted Space , 2019, ICML.