LoSHa: A General Framework for Scalable Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH solutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that reduces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa's API. We evaluate LoSHa and also compare with general-purpose platforms on the same LSH algorithms. Our results show that LoSHa's performance can be an order of magnitude faster, while the implementations on LoSHa are even more intuitive and require few lines of code.

[1]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[2]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[3]  Laurent Amsaleg,et al.  Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  Fan Yang,et al.  A comparison of general-purpose distributed systems for data processing , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[13]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[14]  Anthony K. H. Tung,et al.  LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index , 2016, SIGMOD Conference.

[15]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[16]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[17]  Beng Chin Ooi,et al.  DSH: data sensitive hashing for high-dimensional k-nnsearch , 2014, SIGMOD Conference.

[18]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[19]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[20]  Fan Yang,et al.  LFTF: A Framework for Efficient Tensor Analytics at Scale , 2017, Proc. VLDB Endow..

[21]  Fan Yang,et al.  The Best of Both Worlds: Big Data Programming with Both Productivity and Performance , 2017, SIGMOD Conference.

[22]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[23]  Zi Huang,et al.  SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search , 2014, Proc. VLDB Endow..

[24]  Yi Lu,et al.  Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[25]  Fan Yang,et al.  Husky: Towards a More Efficient and Expressive Distributed Computing Framework , 2016, Proc. VLDB Endow..

[26]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[27]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[28]  Xuemin Lin,et al.  SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index , 2014, Proc. VLDB Endow..

[29]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[30]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[31]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[32]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[33]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.