Consistent Weighted Sampling

We describe an ecient procedure for sampling representatives from a weighted set such that the probability that for any weightings S and T, the probability that the two choose the same sample is the Jacard similarity: Pr[sample(S) = sample(T)] = P x min(S(x),T(x)) P x max(S(x),T(x)) . The sampling process takes expected time linear in the number of non-zero weights, independent of the weights themselves. We discuss and develop the implementation of our sampling schemes, reducing the requisite computation substantially, and reducing the randomness required to only four bits in expectation.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[3]  Sreenivas Gollapudi,et al.  A dictionary for approximate string search and longest prefix search , 2006, CIKM '06.

[4]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[5]  Robert Krauthgamer,et al.  Approximate classification via earthmover metrics , 2004, SODA '04.

[6]  Sreenivas Gollapudi,et al.  Exploiting asymmetry in hierarchical topic extraction , 2006, CIKM '06.

[7]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[8]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[9]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11]  Yuval Rabani,et al.  Approximation Algorithms for Graph Homomorphism Problems , 2006, APPROX-RANDOM.