Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.

[1]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[2]  Michael J. Carey,et al.  A Study of Index Structures for a Main Memory Database Management System , 1986, HPTS.

[3]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[4]  Justin Zobel,et al.  Cache-Conscious Collision Resolution in String Hash Tables , 2005, SPIRE.

[5]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[6]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[7]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[8]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[9]  Junfeng He,et al.  Optimal Parameters for Locality-Sensitive Hashing , 2012, Proceedings of the IEEE.

[10]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[11]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[12]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[13]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[14]  Jignesh M. Patel,et al.  WHAM: A High-Throughput Sequence Alignment Method , 2011, TODS.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  Dinesh Manocha,et al.  Fast GPU-based locality sensitive hashing for k-nearest neighbor computation , 2011, GIS.

[17]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[18]  Karl Aberer,et al.  LSH At Large - Distributed KNN Search in High Dimensions , 2008, WebDB.

[19]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[20]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[21]  Henry A. Kautz,et al.  Modeling the impact of lifestyle on health at scale , 2013, WSDM.

[22]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[23]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[24]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[25]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[26]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.