SES-LSH: Shuffle-Efficient Locality Sensitive Hashing for Distributed Similarity Search

Locality Sensitive Hashing (LSH) is a widely used similarity search technique for many web services, such as content-based retrieval services for images and videos. Due to its popularity, much research effort has been devoted to improving the search quality, and the indexing and query performance of LSH. However, most existing variants of LSH can only run on single node, which limits their applicability to large-scale data. In this paper, we present a Shuffle-Efficient Similarity Search scheme based on LSH, which can be efficiently executed in distributed environments, to serve a massive amount of data. In SES-LSH, a shuffle efficient indexing scheme is proposed to reduce the data shuffle when constructing hash tables, and a location-aware querying scheme is proposed to improve the query performance. We have implemented a prototype of SES-LSH based on Spark, and several optimizations have been utilized to improve the fine-grained hash table operations of distributed LSH. Extensive experiments using large-scale real-world datasets show that SES-LSH is remarkably more efficient than existing methods.

[1]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[2]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[3]  Reynold Cheng,et al.  Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[5]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[6]  Reynold Cheng,et al.  Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[7]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[8]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[9]  Qiang Yang,et al.  Heterogeneous Translated Hashing , 2016, ACM Trans. Knowl. Discov. Data.

[10]  Bo Zhang,et al.  BitHash: An efficient bitwise Locality Sensitive Hashing method with applications , 2016, Knowl. Based Syst..

[11]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[12]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[13]  Olivier Buisson,et al.  A posteriori multi-probe locality sensitive hashing , 2008, ACM Multimedia.

[14]  Haitao Wu,et al.  CubicRing: Exploiting Network Proximity for Distributed In-Memory Key-Value Store , 2017, IEEE/ACM Transactions on Networking.

[15]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[16]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[19]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[22]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[23]  Rasmus Pagh Locality-sensitive Hashing without False Negatives , 2016, SODA.

[24]  Dinesh Manocha,et al.  Fast GPU-based locality sensitive hashing for k-nearest neighbor computation , 2011, GIS.

[25]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[27]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[28]  Anthony K. H. Tung,et al.  HashFile: An efficient index structure for multimedia data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[29]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.