论文信息 - b-Bit minwise hashing

b-Bit minwise hashing

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance >0.5.

Ping Li | Arnd Christian König | Ping Li | A. König

[1] R. A. Fisher,et al. Statistical Tables for Biological, Agricultural and Medical Research , 1956 .

[2] J. Wishart. Statistical tables , 2018, Global Education Monitoring Report.

[3] Forest Baskett,et al. An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[4] Larry Carter,et al. Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[5] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[7] Martti Penttonen,et al. A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[8] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[10] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12] Surajit Chaudhuri,et al. An overview of query optimization in relational systems , 1998, PODS.

[13] Piotr Indyk,et al. A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[14] Patrick Haffner,et al. Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[15] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[16] Edith Cohen,et al. Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17] Dimitrios Gunopulos,et al. Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[18] Jennifer Widom,et al. Database Systems: The Complete Book , 2001 .

[19] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[21] Henry S. Warren,et al. Hacker's Delight , 2002 .

[22] Monika Henzinger,et al. Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[23] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[24] Dimitris Achlioptas,et al. Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[25] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26] Yannis E. Ioannidis,et al. The History of Histograms (abridged) , 2003, VLDB.

[27] Yoshinori Takei,et al. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions , 2003, STOC '03.

[28] Mikkel Thorup,et al. Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[29] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[30] Kenneth Ward Church,et al. Using Sketches to Estimate Associations , 2005, HLT.

[31] Matthias Hein,et al. Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[32] Pietro Perona,et al. Beyond pairwise clustering , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33] Rajeev Motwani,et al. Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[34] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[35] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[36] Kenneth Ward Church,et al. Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2006, NIPS.

[37] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[38] Kenneth Ward Church,et al. Improving Random Projections Using Marginal Information , 2006, COLT.

[39] Alexandr Andoni,et al. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[40] Kenneth Ward Church,et al. Very sparse random projections , 2006, KDD '06.

[41] Ping Li,et al. Very sparse stable random projections for dimension reduction in lα (0 <α ≤ 2) norm , 2007, KDD '07.

[42] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[43] Konstantinos Kalpakis,et al. Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence , 2007, 2007 International Conference on Sensor Technologies and Applications (SENSORCOMM 2007).

[44] Chong-Wah Ngo,et al. Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.