Distributed KNN-graph approximation via hashing

Efficiently constructing the K-Nearest Neighbor Graph (K-NNG) of large and high dimensional datasets is crucial for many applications with feature-rich objects, such as images or other multimedia content. In this paper we investigate the use of high dimensional hashing methods for efficiently approximating the K-NNG, notably in distributed environments. We first discuss the importance of balancing issues on the performance of such approaches and show why the baseline approach using Locality Sensitive Hashing does not perform well. Our new KNN-join method is based on RMMH, a recently introduced hash function family based on randomly trained classifiers. We show that the resulting hash tables are much more balanced and that the number of resulting collisions can be greatly reduced without degrading quality. We further improve the load balancing of our distributed approach by designing a parallelized local join algorithm, implemented within the MapReduce framework.

[1]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[2]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[4]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[5]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[7]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[10]  Pavel Zezula,et al.  Approximate similarity retrieval with M-trees , 1998, The VLDB Journal.

[11]  Gangshan Wu,et al.  Frequency Based Locality Sensitive Hashing , 2011, 2011 International Conference on Multimedia Technology.

[12]  Laurent Amsaleg,et al.  Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[13]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[14]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[15]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Michael A. Casey,et al.  Song Intersection by Approximate Nearest Neighbor Search , 2006, ISMIR.

[17]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[22]  Martial Hebert,et al.  Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[24]  Theoni Pitoura,et al.  Replication, Load Balancing and Efficient Range Query Processing in DHTs , 2006, EDBT.

[25]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[26]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[27]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[28]  Olivier Buisson,et al.  Random maximum margin hashing , 2011, CVPR 2011.

[29]  Nozha Boujemaa,et al.  Large scale visual-based event matching , 2011, ICMR.

[30]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[31]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[32]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[33]  Karl Aberer,et al.  LSH At Large - Distributed KNN Search in High Dimensions , 2008, WebDB.

[34]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[35]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[36]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[37]  Raphaël Troncy,et al.  Linking events with media , 2010, I-SEMANTICS '10.

[38]  Johannes Gehrke,et al.  ATLAS: a probabilistic algorithm for high dimensional similarity search , 2011, SIGMOD '11.

[39]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[40]  Andrew Zisserman,et al.  Object Mining Using a Matching Graph on Very Large Image Collections , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.