Min-Max Hash for Jaccard Similarity

Min-wise hash is a widely-used hashing method for scalable similarity search in terms of Jaccard similarity, while in practice it is necessary to compute many such hash functions for certain precision, leading to expensive computational cost. In this paper, we introduce an effective method, i.e. the min-max hash method, which significantly reduces the hashing time by half, yet it has a provably slightly smaller variance in estimating pair wise Jaccard similarity. In addition, the estimator of min-max hash only contains pair wise equality checking, thus it is especially suitable for approximate nearest neighbor search. Since min-max hash is equally simple as min-wise hash, many extensions based on min-wise hash can be easily adapted to min-max hash, and we show how to combine it with b-bit minwise hash. Experiments show that with the same length of hash code, min-max hash reduces the hashing time to half as much as that of min-wise hash, while achieving smaller mean squared error (MSE) in estimating pair wise Jaccard similarity, and better best approximate ratio (BAR) in approximate nearest neighbor search.

[1]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[2]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[3]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[5]  Ely Porat,et al.  Exponential time improvement for min-wise based algorithms , 2011, SODA '11.

[6]  Thomas Hofmann,et al.  Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2007 .

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[9]  Michael Isard,et al.  General Theory , 1969 .

[10]  Kenneth Ward Church,et al.  Using Sketches to Estimate Associations , 2005, HLT.

[11]  Michael Isard,et al.  Partition Min-Hash for Partial Duplicate Image Discovery , 2010, ECCV.

[12]  Jiri Matas,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, CVPR.

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[15]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[16]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[17]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..