Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

[1]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[2]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2008, Commun. ACM.

[3]  Keon-Myung Lee,et al.  Locality-Sensitive Hashing Techniques for Nearest Neighbor Search , 2012, Int. J. Fuzzy Log. Intell. Syst..

[4]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[5]  Wei Liu,et al.  Scalable similarity search with optimized kernel hashing , 2010, KDD.

[6]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[7]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[8]  Hai Jiang,et al.  Scaling up MapReduce-based Big Data Processing on Multi-GPU systems , 2014, Cluster Computing.

[9]  Shumeet Baluja,et al.  Learning "Forgiving" Hash Functions: Algorithms and Large Scale Tests , 2007, IJCAI.

[10]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[11]  Huzefa Rangwala,et al.  A Map-Reduce Framework for Clustering Metagenomes , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[12]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[13]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Andrei Z. Broder,et al.  A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[19]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[20]  Maozhen Li,et al.  A MapReduce based parallel SVM for large scale spam filtering , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[21]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[22]  Shih-Fu Chang,et al.  Semi-Supervised Hashing for Large-Scale Search , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[24]  Seok-Beom Roh,et al.  Fuzzy Learning Vector Quantization based on Fuzzy k-Nearest Neighbor Prototypes , 2011, Int. J. Fuzzy Log. Intell. Syst..

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[26]  Maosong Sun,et al.  Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[27]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[28]  Keon-Myung Lee,et al.  Big Numeric Data Classification Using Grid-based Bayesian Inference in the MapReduce Framework , 2014, Int. J. Fuzzy Log. Intell. Syst..

[29]  Kyung Mi Lee,et al.  Statistical cluster validity indexes to consider cohesion and separation , 2012, 2012 International conference on Fuzzy Theory and Its Applications (iFUZZY2012).

[30]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[31]  Иван Blekanov,et al.  Hierarchical clustering of large text datasets using Locality-Sensitive Hashing , 2015 .

[32]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[33]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[34]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[35]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[36]  Srikumar Venugopal,et al.  Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce , 2013, ArXiv.

[37]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[38]  Toshikazu Wada,et al.  Principal Component Hashing: An Accelerated Approximate Nearest Neighbor Search , 2009, PSIVT.

[39]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[40]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[41]  Liangxiu Han,et al.  Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences , 2014, Cluster Computing.

[42]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  Erkay Savas,et al.  Efficient top-k similarity document search utilizing distributed file systems and cosine similarity , 2015, Cluster Computing.

[44]  Sol Ji Kang,et al.  Performance Comparison of OpenMP, MPI, and MapReduce in Practical Problems , 2015, Adv. Multim..

[45]  Chen Lin,et al.  MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data , 2015, Comput. Intell. Neurosci..

[46]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[47]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Kyung Mi Lee,et al.  A Locality Sensitive Hashing Technique for Categorical Data , 2012 .

[49]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).