论文信息 - Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

Locality-Sensitive Hashing (LSH) is extremely competitive for similarity search, but works under the assumption of uniform access cost to the data, and for just a handful of dissimilarities for which locality-sensitive families are available. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.

[1] David Novak,et al. Metric Index: An Efficient and Scalable Solution for Similarity Search , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[2] Olivier Buisson,et al. A posteriori multi-probe locality sensitive hashing , 2008, ACM Multimedia.

[3] Edgar Chávez,et al. On locality sensitive hashing in metric spaces , 2010, SISAP.

[4] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5] Pavel Zezula,et al. Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[6] Zhe Wang,et al. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[7] Ricardo da Silva Torres,et al. MONORAIL: A Disk-Friendly Index for Huge Descriptor Databases , 2010, 2010 20th International Conference on Pattern Recognition.

[8] Gonzalo Navarro,et al. Metric Spaces Library , 2008 .

[9] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10] Ricardo A. Baeza-Yates,et al. Searching in metric spaces , 2001, CSUR.

[11] Matthijs Douze,et al. Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Piotr Indyk,et al. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[13] David Novak,et al. On locality-sensitive indexing in generic metric spaces , 2010, SISAP.

[14] Nicole Immorlica,et al. Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[15] Peter J. Rousseeuw,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[16] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[17] Caetano Traina,et al. Using Pivots to Speed-Up k-Medoids Clustering , 2011, J. Inf. Data Manag..

[18] Hae-Sang Park,et al. A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[19] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[20] Laurent Amsaleg,et al. Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[21] Byungkon Kang. Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in Large Data Sets , 2012 .