Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are sequential and target shared-memory machines. Here we address these issues with a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular techniques for similarity search, but its poor referential locality properties has made its implementation a challenging problem. Our solution is based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage. The proposed parallelization attained an efficiency of 90% in a distributed system with about 800 CPU cores. In particular, the original locality-aware data partition reduced the number of messages exchanged in 30%. Our parallel LSH was evaluated using the largest public dataset for similarity search (to the best of our knowledge) with $10^9$ 128-d SIFT descriptors extracted from Web images. This is two orders of magnitude larger than datasets that previous LSH parallelizations could handle.

[1]  Wagner Meira,et al.  Achieving Multi-Level Parallelism in the Filter-Labeled Stream Programming Model , 2008, 2008 37th International Conference on Parallel Processing.

[2]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ümit V. Çatalyürek,et al.  Optimizing dataflow applications on heterogeneous environments , 2010, Cluster Computing.

[5]  Ricardo da Silva Torres,et al.  Adaptive parallel approximate similarity search for responsive multimedia retrieval , 2011, CIKM '11.

[6]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[7]  Cordelia Schmid,et al.  Query adaptative locality sensitive hashing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[9]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[10]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[11]  Jun Kong,et al.  Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines , 2013, Parallel Comput..

[12]  D. Lowe,et al.  Fast Matching of Binary Features , 2012, 2012 Ninth Conference on Computer and Robot Vision.

[13]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Lawrence Cayton,et al.  Accelerating Nearest Neighbor Search on Manycore Systems , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[16]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[17]  Laurent Amsaleg,et al.  Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[18]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[19]  Matthieu Cord,et al.  High-dimensional descriptor indexing for large multimedia databases , 2008, CIKM '08.

[20]  Olivier Buisson,et al.  A posteriori multi-probe locality sensitive hashing , 2008, ACM Multimedia.

[21]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[23]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[24]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[25]  Xiaoyu Yang,et al.  A Landmark-based Index Architecture for General Similarity Search in Peer-to-Peer Networks , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[26]  Joel H. Saltz,et al.  Approximate similarity search for online multimedia services on distributed CPU–GPU platforms , 2012, The VLDB Journal.

[27]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[28]  Matthijs Douze,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Martin Krulis,et al.  Combining CPU and GPU architectures for fast similarity search , 2012, Distributed and Parallel Databases.

[30]  Junfeng He,et al.  Optimal Parameters for Locality-Sensitive Hashing , 2012, Proceedings of the IEEE.

[31]  Jun Kong,et al.  High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[32]  Metin Nafi Gürcan,et al.  Coordinating the use of GPU and CPU for improving performance of compute intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[33]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[34]  Noah Treuhaft,et al.  Cluster I/O with River: making the fast case common , 1999, IOPADS '99.

[35]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[36]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[37]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[38]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.