One Permutation Hashing

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, 6-bit minwise hashing has been applied to large-scale learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

[1]  Forest Baskett,et al.  An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[2]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[3]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[4]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[5]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[10]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[11]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[12]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[14]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  Kenneth Ward Church,et al.  Using Sketches to Estimate Associations , 2005, HLT.

[16]  Divesh Srivastava,et al.  Approximate Joins: Concepts and Techniques , 2005, VLDB.

[17]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[18]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[19]  Kenneth Ward Church,et al.  Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2006, NIPS.

[20]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[21]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[22]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[24]  Ping Li,et al.  Very sparse stable random projections for dimension reduction in lα (0 <α ≤ 2) norm , 2007, KDD '07.

[25]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[26]  Kenneth Ward Church,et al.  A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations , 2007, CL.

[27]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[28]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[30]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[31]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.

[32]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[33]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[34]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[35]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[36]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[37]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[38]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[39]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[40]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[41]  Ping Li,et al.  b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions , 2012, ArXiv.

[42]  Ping Li,et al.  Fast Near Neighbor Search in High-Dimensional Binary Data , 2012, ECML/PKDD.

[43]  Ping Li,et al.  b-bit minwise hashing in practice , 2013, Internetware.

[44]  Ping Li,et al.  Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search , 2013, NIPS.

[45]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[46]  Ping Li,et al.  Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search , 2014, ICML.