论文信息 - One Permutation Hashing

One Permutation Hashing

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, 6-bit minwise hashing has been applied to large-scale learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

[1] Forest Baskett,et al. An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[2] Larry Carter,et al. Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[3] Larry Carter,et al. Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[4] Noam Nisan,et al. Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[5] Noam Nisan,et al. Pseudorandom generators for space-bounded computation , 1992, Comb..

[6] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[8] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[10] Patrick Haffner,et al. Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[11] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[12] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[14] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15] Kenneth Ward Church,et al. Using Sketches to Estimate Associations , 2005, HLT.

[16] Divesh Srivastava,et al. Approximate Joins: Concepts and Techniques , 2005, VLDB.

[17] Steve Chien,et al. Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[18] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.