Extended Min-Hash Focusing on Intersection Cardinality

Min-Hash is a reputable hashing technique which realizes set similarity search. Min-Hash assumes the Jaccard similarity \(\frac{|A\cap B|}{|A\cup B|}\) as the similarity measure between two sets A and B. Accordingly, Min-Hash is not optimal for applications which would like to measure the set similarity with the intersection cardinality \(|A\cap B|\), since the Jaccard similarity decreases irrespective of \(|A\cap B|\), as the gap between |A| and |B| becomes larger. This paper shows that, by modifying Min-Hash slightly, we can effectively settle the above difficulty inherent to Min-Hash. Our method is shown to be valid both by theoretical analysis and with experiments.